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Abstract 

Up-and-Down and the Percentile-Finding Problem 

Assaf Pcrctz Oron 

Chair of the Supervisory Committee: 
Associate Professor Peter D. Hoff 
Statistics 

Up-and-Down (U&D), a popular sequential design for estimating threshold percentiles in binary 
experiments, has not been a major subject of statistical research since the 1960's. U&D application 
practices have stagnated, and significant gaps in understanding its properties persist. The first 
part of my work aims to fill critical gaps in U&D theory. All U&D variants studied generate 
Markov chains of treatments. New results concerning stationary distribution properties are proven. 
A commonly used U&D variant known as "fixed forced-choice staircase" or "k-in-a-row" (KR), 
is proven to have a single-mode stationary distribution contradicting recent remarks about it in 
literature. Moreover, KR is shown to be the best-performing design in terms of convergence rate and 
estimation precision, compared with other designs targeting the same percentiles. A second focus of 
this study is nonparametric U&D estimation. An improvement to isotonic regression called "centered 
isotonic regression" (CIR), and a new averaging estimator called "auto-detect" are introduced and 
their properties studied. Interval estimation solutions are also developed for both estimators. 

Bayesian percentile-finding designs, most notably the continual reassessment method (CRM) 
developed for Phase I clinical trials, are also studied. CRM is believed to converge to exclusive 
allocation to a single optimal level, but to date, this belief had not been substantiated by extensive 
proofs. Here I add several proofs, expanding the understanding of when CRM might converge. In 
general, CRM convergence depends upon random run-time conditions - meaning that convergence 
is not always assured. Small-sample behavior is studied as well. It is shown that CRM is quite 
sensitive to outlier sub-sequences of thresholds, resulting in highly variable small-sample behavior 
between runs under identical conditions. Nonparametric CRM variants exhibit a similar sensitivity. 

Ideas to combine the advantages of U&D and Bayesian designs are examined. A new approach is 
developed, using a hybrid framework, that evaluates the evidence for overriding the U&D allocation 
with a Bayesian one. This new design, called "Bayesian Up-and-Down" (BUD), is shown numeri- 
cally to be more robust than CRM, and moderately outperforms either CRM or U&D in terms of 
estimation. 
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1.1 Preface 

In many binary-response experiments, researchers are not interested in finding the entire 
treatment-response curve, but only a single representative value, which can be described via 
a percentile of the underlying threshold distribution F. This percentile can represent (for 
example) the sensory threshold, the LD^q of a toxic substance, or the maximum-tolerated- 
dose (MTD) of a medication. Such experiments are typically limited to a sample of several 
dozen trials or even less, and with little prior knowledge about the form of F. An additional, 
rather common constraint is that treatment levels are limited to a fixed, discrete, finitely- 
spaced set. 

Statisticians have been developing designs to answer this type of challenge for at least two 
generations. Yet, interestingly, if an applied researcher were to seek statistical consulting 
on this problem, he or she would receive wildly differing recommendations, depending upon 
which field they work in or which statistician happened to be the consultant. Moreover, an 
applied statistician wishing to consult a client on this issue would discover, upon a cursory 
literature search, that any clear, well-tested and practical guidelines (if they exist) are 
drowning in a sea of contradictory "best practices" recommendations, antiquated references, 
unverified claims and controversy. In short: until quite recently, there was no up- 
to-date, standard reference for this problem]^ 

These are not hypothetical statements; this is exactly what happened to me in late 2003, 
when I was enlisted to help Dr. M. J. Souter design an anesthesiology percentile-finding 



experiment (jOronl . 12004 : iBhananker et al.l . 120071 ) . My search for solid and reliable solutions 



has evolved into the dissertation you are reading. After this tortuous journey, I find it 
important to repeat in this introduction a well-known truism, which is somehow forgotten 
in this field's prevalent patterns of debate. Here it is, in the form of a disclaimer: 



Constrained by the binary response, a small sample size, an unknown response 
threshold distribution and a discrete treatment set, the quality of information 
the experiment can provide is limited. Failed experiments (however one 
chooses to define "failure") are far from unlikely. The degree of confidence 
expressed by many statisticians promoting this or that method is exagg eratedi 

Keeping this perspective in mind, there is quite enough room for theoretical insight and 



Apparentl y, I am not the only p erson to have noticed this. IChevretl (|2006l ') is a recent attempt to fill 
the gap. P ace and Stvlianoul ()2007h 's review is an attempt to update the anesthesiology community with 
recent statistical developments relevant to their needs. 

• ^Fisheij l|2007l ). in a recent Anesthesiology editorial, voiced similar concerns regarding the use of 'Up-and- 
Down' methods. 
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methodological developments. The next section will expand and clarify these statements. 

1.2 Introduction 

1.2.1 Conceptual Framework 

As the opening paragraph explains, we are looking for a treatment value x that will gen- 
erate positive response with probability p. We assume that this probability is monotone 
increasing in x. Furthermore, we assume that the response is deterministically triggered 
via an underlying population (sub)distribution of response thresholds -F(t), with a density 
We aim to estimate Qp = F^^{p), known as the target percentile, or simply the 
target. 

An excellent, perhaps ideal experiment, would be to directly sample the thresholds them- 
selves, {Tn} ~ /. If this is possible, then without knowledge of F a good target estimate 
would be T(„p), the sample np-th order statistic (or lOOp-th percentile). For large n, T^^^) 
is approximately normal with asymptotic variance 



This can be perceived as a rough precision benchmark for the percentile-finding problem. 

Our conditions, however, are far from ideal. Rather than the thresholds themselves, we 
observe binary responses that can be interpreted as censored current-status informa- 
tion: at a given trial indexed i with treatment Xj, we know that tj < Xj if the response 
was 'yes' and ti > Xi if it was 'no'. If we pool all our treatments to the same value xq, we 
can estimate F(xq) with variance F(xq) [1 — F(xq)] jn. This is not a very impressive feat; 
for example, unless F{xq) is close to or 1, we need n 100 to get a 95% CI of ±0.1 for 
F{xq). And the best we can achieve this way, is knowing whether Qp > xq, Qp < xq, or 
(if we are lucky) Qp ~ xq. It is no wonder that in the related problem of binary-response 
opinion polls (which are analogous to estimating F at a single specified xq), the minimal 
acceptable sample size is several hundred. 

If we try to cover a larger range of treatments by splitting the sample equally between m 
levels {h . . . Im), we will obtain m even less precise point estimates, with many observations 
taken quite far from Qp - indicating a waste of resources. The MLE with no knowledge of 
F's shape is not much more than an interpolation between these rather blunt point estimates 

^This last assumption is often wrong, especially in psychophysical experiments to find the sensory thresh- 
old of a single subject. In that field, it is common to apply fixes such as added parameters for the false 
positive and false negative rates. However, the simplifying assumption allows us to treat x and t as one 
and the same, and focus on the core features of the problem. 



yar [T(^np)] 



P(l -P) 
nPiQp)' 
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(this is examined more closely in Chapter [3]). Taking the range-covering approach further, 
if we could let x vary randomly and continuously, the nonparametric ML E of F for current 



status data converges - both globally a nd locally - only at an v}/^ rate ( Kim and Pollard 



1990; 



Groenenboom and Wellner 



1992 ). 



In short, the problem's properties imply that our prospects for solidly successful target 
estimation are not very promising. 

A way out of this quagmire would be some "intelligent" sequential scheme, that would 
lead us to collect most of our information in the vicinity of Qp ~ even though before our first 
trial we have little knowledge where that target might be - and then allow for reasonably 
useful estimationll] Again, as hinted previously, statisticians and others have suggested 
many such schemes, but the most widely used methods fall almost exclusively into one 
of two groups. These are Markovian or Up-and-Down (U&D) designs, and Bayesian 
schemes. 

I shall revisit this generic sampling perspective occasionally throughout the dissertationlfl 
Typically, these will also be the only places where the response threshold distribution might 
be referred to as F[t). Otherwise, because of the threshold-triggering assumption spelled 
out above, I will refer to the threshold CDF as F{x): an (indirect) function of treatments. 



1.2.2 Some Definitions and Constraints 

Before we proceed further, it may be a good idea to define the two design approaches. The 
following definition is bro a d enough to in clud e all methods discussed in the thesis, and is in 
the spirit of lstoreil (|200lh : IPotteil (12002) and llvanova and Flournovl (120061). 



Definition 1 (i) An 'Up-and-Down' (U&D) or algorithm-based design is any percentile- 
finding, binary-response sequential experimental design with treatments and responses {x^}, {vn], 
respectively; with fixed, discrete treatment levels {Im} and sequential treatment allocation 
rules, which are functions o/{x„},{y„} and possibly also (p, a set of fixed parameters. 

(a) A Bayesian or model-based scheme is as in (i) above, but the allocation rules 
are functions of {xn},{yn}-,4>! o,nd a statistical model G {{xn^^iVn} \ 0,^), where 9 is a set 
of data- estimable parameters. 

As Definition [1] indicates, the dissertation will only deal with designs constrained to a 
fixed discrete set {Im}- This excludes methods such as stochastic approximation and its 

''observing (ILlf) . the concentration of observations around target is tantamount to artificially increasing 
f{Qp) - with a linear effect on estimation error. 

^See Sections 12. 2. 113. 2. 114. 2. 51 14. 3. II and 5.3.1. To complete this bird's-eye view of the dissertation, I also 
suggest Section fS.GI - which summarizes my U&D-related findings in the form of practical recommendations 
- and the conclusions. 
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descendants ( Robbins and Monro! . 



1951 



Lai l2003l ). Additionally, we will usually assume 



that {Im} is evenly spaced (whether on the original or on a transformed scale), with level 
spacing Other simplifying assumptions are that the underlying threshold CDF F is 
continuous strictly increasing, that it has a density /, and that thresholds in the experiment's 
trials are i.i.d. draws from /. 

These assumptions will save a great deal of burdensome notation, and should help us 
focus on essential design and estimation properties rather than on terminology0 



1.2.3 Audience and Scope 

Percentile-finding is an application encountered in many research fields. However, two fields 
have dominated the design debate. In one of them (psychophysics or sensory studies), it 
is because the field's most routine experiment - determination of sensory thresholds - is 
modeled as a percentile-finding problem. Proper treatment of the sheer wealth and breadth 
of percentile-finding designs used in psychophysics may take several books. Fortunately, the 
constraints outlined above already filter out a large part of these designs. 

The second field (Phase I clinical trials) is dominant not so much because of its preva- 
lence, or because of the wealth of designs. In fact, the vast majority of Phase I trials still 
appear to use conservative protocols, which most statisticians (myself included) view as out- 
dated. The 'Phase I' application has dominated debate mostly because of the large amount 
of resources and statistical attention commanded by clinical trials in general. 

The typical practical constraints in these two fields are very different. In psychophysics, 
most experimental runs are carried out on a single subject. Sample size and trial cost are 
usually not a problem, and in principle the treatment can vary continuously (though most 
designs prefer to stick to a discrete set). The sensory threshold itself is also defined quite 
flexibly: anything from the median to the 80th percentile may be commonly encountered. 
By contrast. Phase I trials are limited to small sample size (< 40), and are usually carried 
out in cohorts of 3 patients. The target is typically between the 20th and 33rd percentile, 
and due to toxicity concerns it is preferably approached from below. 

In between, the "vanilla" engineering and scientific applications are usually median- 
targeting, size-constrained, and for lack of statistical attention have continued to plough 
along with 1940's to 1960's methodology. The three application types are summarized in 
Table 

®In some cases we will assume {Im} is finite and in others not, depending upon context. Usually the 
difference between the two cases is more notational than substantial. 

^Most results discussed here can be easily extended to cases when one suspects F to behave otherwise, 
or wishes to run the experiment on an unevenly spaced {Im}- 
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Table 1.1: Side- by-side comparison of the properties of three most prominent types of 
percentile-finding apphcations mentioned in the text. 



Application 


Psychophysics 


Phase I 


General Engi- 








neering 


Target 


Median or above 


20th to 33rd Per- 


Median 






centile 




Cohort Size 


1 


2-6 


1 


Major Constraints 


Simplicity, re- 


Toxicity, medical 


Simplicity, cost, du- 




sponse fa- 


ethics, volunteers, 


ration, toxicity (in 




tigue /reliability 


duration 


some cases) 


Currently Popular 


A large variety, in- 


'3-1-3' protocol. 


U&D 


Designs 


cluding U&D and 


with Bayesian de- 






Bayesian designs 


signs increasingly 








popular 





My approach has been to steer clear of complete commitment to a single field's specific 
needs. The rationale is that this is precisely a statistical researcher's role: to look at the 
generic problem, examine and formulate solutions to it - solutions useful for all applications. 
From these generic solutions, it should not be difficult to tailor specific applications. All 
that being said, given the medical motivation that introduced me to the problem, the terms 
and concerns encountered in medical literature are more prevalent in this thesis than those 
of other fields. 



1.2.4 Thesis Layout 



The thesis focu ses on the most ve t eran family of U&D designs, which includes the original 
1940's method (jPixon and Moodl . Il948l l. Chapter [2] tackles the designs themselves, while 
Chapter [3] is devoted to estimation. In rece nt decades, Bayesian model-based designs known 



under acronyms such as QUEST and CRM (jWatson and Pellil . 119831 : iQ'Quiglev et al.l . ll99d ) 
have quickly eclipsed U&D in the number of statisticians studying them (though not in 
their prevalence of use). These designs are discussed in Chapter HI There have been some 
attempts to combine the two approaches, and I have devoted quite a bit of energy to 
this as well. Combined approaches are discussed in Chapter [5l which is followed by the 
conclusions. To minimize confusion between the thesis terminology and that of 
other publications, I will refer to the algorithm-based design family simply as 
U&D, and to the model-based family as CRM or "Bayesian". 
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Rather than precede the thesis body with a long and tedious introduction to ah these 
subjects, I have opted to start each chapter with an introductory section, which reviews the 
current state of knowledge on that chapter's topic. Such sections are followed by sections 
with theoretical results, and then sections with new methodological developments. In Chap- 
ter [3l the dissertation's largest chapter, there is also a separate section presenting numerical 
results Ifl, a section reporting and analyzing the results of the motivatin g anesthesiology 
experiment, which has been recently completed (jBhananker et al.l . 120071 ) - and finally, a 
section summarizing Chapters 2-3 with a list of practical recommendations. At the end of 
Chapters 2, 3 and 4 the reader can find glossaries for these chapters. 



1.2.5 The Role of Simulation 

In fall 2003, while still on the anesthesiology consulting job, team leader Nancy Temkin 
suggested that I use computer simulation to see how different U&D designs pan out. I 
proceeded to do so, unwittingly following in the footsteps of Wetherill, who in the 1960's 
traveled across t he ocean to make use of compu t ing r esources not available at the time 



in Great Britain (jWetherill 



1963 



Wetherill et al 



19661 ). This was such a novelty back in 



England, that the mere use of simulati ons became a m ajor gossip topic during the customary 
Royal Society discussion appearing in lWetherilll (|l963l ). Ever since then, for better or worse, 
simulation has played a central role in evaluating percentile -finding designs and estimat ors. 
Sometimes this approach is taken to the extreme: recently, lO'Quiglev and Zohaii (|2006l ) in 
their mini-review of CRM claimed that numerical simulation should be the main approach 
in studying and evaluating designs for the small-sample percentile-finding problem. 
I could not disagree more. 

At their best, simulations can perform a role analogous to that of lab experiments - 
proving or disproving a theory, demonstrating a theoretical result, and revealing novel pat- 
terns that inspire further theoretical research. Much of the painstaking work carried out 
by Garcfa-Perez and colleagues ove r a decade, examining percentile-finding designs in psy- 



chophysics, belongs to this categorv (iGarcia-Perej . 



2004 



Garcia-Perez and Alcala-Quintanal . l2005l ). 



1995 



Alcala-Quintana and Garcia-Perezj . 



Not at their best, simulations can be a lazy man's escape from doing theory, producing 
individual trees that are then mistaken for a forest. A classic example is Wetherill's "re- 
versal" estimator (see Cha pter pi) , chosen in a large part because of favorable simulation 

19661 ). As Section [3.21 will show, in general this estimator is in 



outcomes ( Wetherill et al. 



fact inferior. 



'in other chapters, numerical results are embedded within the sections. 
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At their worst, simulations can be abused to further irrelevant agendas. Here the classic 
example is the rise of Bayesian designs for Phase I trials. Researcher s have used simulations 
to "prove" this approach 's advantage over others, most notably U&D (jO'Quiglev and Chevretl . 



1991 



Babb et al. 



19981 ). Immediately upon presenting the first such simul ation study, the 



authors openly called upon regulatory authorities to discontinue use of U&D (|Q'Quiglev and Chevretl . 
199l|). While this call was not heeded, such simulations and the confidence with which they 



were presented have doubtlessly played a role in shutting most of the statistical world off to 
U&D. But in both cases the comparisons have been "ri gged" in several ways against UfcD 
(see s ection 14.11 for details) . More neutral comparisons (|Garcia-Perez and Alcala-Quintanal . 



20051 ) have yielded an almost diametrically opposite conclusion. I hope this thesis would 



help explain why. 

During my work on this problem, I have spent thousands of human and CPU hours 
on simulation. I tried to keep simulations "at their best" as defined above. Many times I 
have found myself to be lazy, discovering that 2 months of simulations have saved me... 10 
minutes of theory. I have been quite wary, and hopefully have succeeded, in keeping myself 
from abusing simulations, as happened in the cases mentioned above. 

As the thesis materialized, I have tended to base more and more of my results on theory 
and not on simulation. This has been a mostly successful endeavor. Simulation in this thesis 
is confined mainly to illustrating the pros and cons of different designs and point estimators 
under different scenarios, or for such essential technical tasks as evaluating the coverage of 
interval estimators. There are one or two exceptions, where the numerical pattern strongly 
suggests a theoretical result, but my quest for that result has been incomplete (U&D con- 
vergence comparison in Section [2. 2 1 smoothed isotonic regression in Chapter [3]). Here I have 
to cling to the well-worn, job-security cliche: hopefully future work will find that proof. 
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Chapter 2 

THE UP-AND-DOWN DESIGN AND SOME VARIANTS 
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2.1 Description and Current Knowledge 

2.1.1 Basic Terminology 

Let X be some experimental treatment with a binary response Y{X) ~ Bernoulli {F{X)). 
We assume F is continuous and strictly increasing. Researchers look for Qp = F^^(p), 
the lOOp-th percentile (or p-th quantile) of F. Treatments are administered sequentially: 
both treatments and responses are indexed Xi,yi, i = l...n. Following trial i, the next 
treatment Xj+i is determined by some subset of all previous treatments and responses 
{xi . . . Xi,yi . . . yi}, and by the specific method's transition rules. In U&D methods, exper- 
imental treatments are restricted to a fixed set of levels lu,u = 1, . . . m, with level spacing 
s0 The subsequent discussion focuses on targets at or below the median, but all results can 
be trivially translated to targets above the median. 



2.1.2 History 



The up-and-down (Ufcp) sequential me t hod w as developed in the 1940's (jPixon and Mood 



19481 : Ivon Bekesvl . 119471 : 



Anderson et al, 



19461 ) . It is designed to estimate the median thresh- 



old, under constraints of moderate sample size and a discrete set of treatment levels. It is 
still in use tod ay in its origina l appli cations - finding the median sensitivit y of explos i ves to 



1995 



shock loading (IChao and Fuhl . |2003| ) and estimating sensory thresholds (jTreutwein 
Garcia-Perea . 1 19981 ) . It can also be found in a wide arr ay of applied research fields - includ- 
ing fatigue testing in metallur gy and materia. 1 scien ce (jLagoda and Sonsind . |2004| ). testing 



of dental restorative materials 



Scherrer et al. 



20031 ) , breakdown voltage e stimation in elec 



trical engineering (|Komori and Hirosd. l2000l). animal toxicity respo nse (jSunderam et al 



NIEHS 



2001 



20041 ) , and anesthesiology (jCapogna et al.l . l200ll : iDrover et al 
cations, UfcD is considered a standard method (|jSMeI . Il98l 



200411. In numerous appli- 



ASTM 



1991 



OECD . 



1998 



). Note on terminology: the term "up-and-down" used to refer only to the 
median-finding design, but hereafter I shall use it in its more recent interpretation - any 
percentile-finding design that can be seen as an extension of the original method. That 
original method will be called 'simple up-and-down' ( SUfcD). 



Methodolog i cal research during UfcD's early years (IBrownlee et al.Ul953l : iDermanl . 



Wetherill . 


1963: 


Dixon. 


1965: 



Wetherill et al 



1966 



1957 



Tsutakawal . Il967b ) generated accepted 



practices for design and estimation. The first proper identifi cation of the U&D treatment 



sequence as a Markov chain is credited to 



Tsutakawa 



(1967; 



Paradoxically, just as this 



discovery was made, and as U&D was proliferating into more and more fields, statistical 



^However, since F is continuous, we shall sometimes refer to a; as a continuous variable when studying 
properties of F. 
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interest in the method had begun to dwindle. From the mid-1970's until the late 1980's, 
very few U&D theoretical or methodological studies were published. This gap has probably 
contributed to the rather uneven pattern of usage observed today. Different fields adhere 
to different practices, whose rationale has often been lost in the mists of time. 



Sample size is a case in point. iDixon and MoodI (119481 ) originally recommended a sample 
size of n ~ 40 for median estimation. Shortly afterw ards, there was a trend to ta. i lor th e 
method for smaller and smaller sample sizes, n < 10 (jBrownlee et al.l . 119531 : iDixonl . Il965l ). 
Nowadays, some military industries still use n ~ 40 for median-sensitivity estimation (H. 
Dror, personal communication, 2006). Some groups in anesthesiology use a fixed sample 



size of 30 for ED5O es t imati on (jCapogna et al 



2001 



Camorcia et al 



20041 ). In the same 



field, iPaul and Fisheii (|200ll ) report that many anesthesiologists use the fourth reversal as 
a stopping rule - roughly equivalent to a dozen treatments (reversals will be defined later 
in this section). Meanwhile, many researchers in anesthesiology and elsewher e use n < 10 



for the same goal (jLichtmanl . Il998l : 



Sunderam et al. 



20041 : 



Drover et al 



200J). 



Since around 1990, methodological interest in U&D has been gradually picking up again. 
Practitioners in fields using U&D have expressed increased interest and skepticism about the 



suggestions ( 


Garcia-Perez 




1998 




Zhane and Kececioelu 


, ] 


L998: 


Braam and van der Zwaae. 


1998: 


Lin et al.. 


2OO1I: 


Paul and Fisher. 


2001 


: IPollak et al. 


. 20061"). Several statistical teams 



have entered the fray as well, mostly in th e context of Phase I clinical trials, which target 
the 20th to 33rd percentiles (jStoreii . Il989l ). This rece nt research has yielded results that 



Durham et al 



1995 



Gezmu 



Bortot and Giovagno. 



200c 



Chao and Fuh 



200 



2001 



I 



1996 



Giovagnoli and Pintacudal . 119981 : 



Gezmu and Flournov 



Stylianou and Flournov 



this progress is summarized bv llvanova and Flournov 



incorporate adyance s in the s tudy of Markov chains (iDurhain and Flournov!. 11994 , 



Ivanova et al 



2006 ) and in estimation ( Komori and Hirose 



2002; 



Stvhanou et al.l. I2OO3I'). M uch of 



1995; 



2003 



(j2006l ) and lPace and Stvlianou 



All the same, some of the U&D's basic properties are still seldom discussed. 

But perhaps before everything else, the simple question of which U&D variant to choose 
for a given application, especially for non-median targets, has not been properly addressed. 
Since the 195 0's researchers have cr e atively p ropos e d many dozens of non - median UfcD 
varian t s (e.g.. IWetherill et al. I Jl966ll: Istoreil Jl989 ll: 



(119981 ^: 



Ivanova et al 



Durham et al 



(|l995l ): 



Garcia-Perez 



Bortot and Giovagnolil (|2005l )). Yet, only a handful of articles 



actually compare different U&D methods' performance, and usually only via limited-scope 
simulations □ There are also many innovative non-U&D sequential designs to achieve the 



^The only exception that comes to mind is (|Bortot and Giovagnolil , boOSi ) . who prove that BCD is optimal 
among first-order Markovian methods. But applied researchers do not care about a method's Markov- 
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same goal (jO'Quigley et all 119901 : iRosenberger and Grill 119971 ) , and one may legitimately 



question whether U&D is the best platform. But a comparison of U&D with other ap- 
proaches means little, if the U&D design and estimation methods used are clearly inferior 
to other available U&D methods, of which the researchers were not aware. 

Another interesting aspect of the historical gap in U&D research is the method's per- 
ceived nature. U&D was originally tied to a parametric assumption: th e thresholds were 
assumed to be normally distributed ([Dixon and Moodl . ll948l : lDixonl . ll965l ). Nowadays U&D 
is almost universally seen as a nonparametric design. There has been little theoretical explo- 
ration into the ramifications of this change in perception, regarding design and estimation 
as actually practiced in the field. 



2.1.3 Transition Probabilities, Balance Equations and Stationary Formulae 



The original, median-targeting simple up-and-down (SU&D) starts at an arbitrary level, and 
then moves up or down 1 level, following 'no 'or 'yes' responses, respectively. This makes 
SU&D a lattice random walk, with 'up' and 'down' transition probabilities = 1 — F{lu) 
and q„, = F(l^,), respectively, j Assuming Fjl^,) £ (0, 1) Vn, there is a stationary distribution 



(jTsutakawal . Il967bl : [Durham and Flournovl . Il995l ). obeying: 



TTu+igu+i, 



lu + l 



u = 1 . . .m — 1 

l-Fjlu) ^ 1-F{x) 
F{x+s) 



(2.1) 



with straightforward normalization to determine tti. The symbol ■ju, hereafter termed "the 
stationary distribution profile", is monotone decreasing in u. Therefore, the stationary 
distribution has a single mod e, including the possibility that the mode is on a boundary 
^Durham and Flournov . 1995 ). T he UfcD treatm ent sequen ce is known as a randoni walk 
with a central tendency (see e.g., iHughesl . Il995l . Ch. 3.4). iDurham and Flournovl ()l994l ) 
prove that the stationary mode is at most one spacing interval away from the median. 

This is the place to note that U&D's basic properties, such as a single mode near target, 
are retained even under an infinite (unbounded) set of treatment levels. Since the profiles 
{7m} are monotone decreasing, with reference to the frequency at the mode the stationary 
frequencies form a sequence that decreases in a faster-than-geometric rate to either direction. 
Therefore, the overall sum of frequencies is a convergent series, even if carried to infinity, 
and TT always exists with a mode near target. Hence, distinguishing between a bounded set 



chain order; they would Uke to know which method is best for estimating a given percentile, under certain 
practical constraints. 

^On the boundaries, trivial "reflecting" conditions are imposed; this type of boundary conditions is 
Eissumed throughout the text, but explicit boundary details are omitted for brevity. 
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of treatments indexed 1 . . . m, and an unbounded one indexed u, — oo < n < oo presents 
only a minor technical challenge and not a conceptual difference. Furthermore, as will be 
seen in Chapter [3l in some sense the unbounded-design version of U&D is the "pure" one. 



DermanI (|l957l ) suggested modifying U&D via randomization in order to target non- 



median percentiles. However, t he method now known as 'biased -coin' design (BCD) 
was developed independently bv lDurham and Flournovl (jl994l . ll995l ). It is similar to SU&D, 
except that after a 'no ' response the next treatment is determined using a random draw: 
with probability r/(l — F) we go up one level, otherwise the level is unchanged (F G (0, 0.5]). 
Now Tu, the probability of remaining at the same level for another trial, can be nonzero. 
The resulting transition probability rule as a function of F is 



Pu = [I - F{lu)]j^, u=l...m-l; 
r^ = [l-F{lu)]^, n = 2...m-l; 
qu = F{lu), u = 2...m 



(2.2) 



The stationary distribution obeys (jPurham and Flournovl . 1 19951 ) 

1-F{Q F 



7« 



F{lu 



1 



(2.3) 



Again, the walk has a central tendency and a single mode; it was shown ([Durham and Flournovl . 
1994 ) that the stationary mode is at most one spacing interval away from Qy, which is the 
method's designated target. 

Another non- median UfcD me t hod i s knowri as "f orced-choice fixed staircase" or 'k- 
in-a-row ' (KR) (|wetherin et al.l . Il96d : loezmiJ . ll99fil lFI Here there is no random draw; 
instead, we must observe exactly k consecutive 'no's at a given level before moving up (the 
'down' rule remains as in SU&D). For k = 1, KR reduces to SU&D. Even though KR is 
by far the most widely used non-median U&D method and is arguably easier to administer 
than BCD (requiring no random draw), its Markov chain properties are more complex and 
have rarely been studied. KR's transition probability matr ix and statioii ary distribution 
were described by Gezmu in her unpublished dissertation (|Gezmul . Il996l ). In numerical 



runs, KR ha. s repeatedly exhibited advantages over similar-targeted BCD designs (see also 



Stored . Il989l : 



Ivanova et al. 



20031 ) . However, Gezmu and her collaborators have come to 



doubt the method's theoretical properties, most notably the existence of a single stationary 



^In fairness, I should have used the term 'staircase' to refer to this method, since this is how most of 
its users (in psychophysics) know it. However, 'k-in-a-row' is shorter and more descriptive. Moreover, 
rigorously speaking, all 3 U&D 'flavors' discussed here answer the definition 'forced-choice fixed staircase' 
(as con trasted with ' adaptive staircase', in which the 'up' and 'down' steps have different level spacing; 
see e.g. iGarcia-Perej lfl998) |). 



14 



mode (jivanova and Flournoyl . l2006l ) . This and other properties wih be explored in the next 
section. 



One aspect of KR that has been known since its inception (IWetherill et alJ . Il966l ) is its 
target: it is the solution to the equation [§ 



[l-F(Qp)]'= = l 



p = F(Qp) = l-(l) 



1 - HQpt 



(2.4) 



Unlike BCD, KR can target only a discrete set of percentiles; for k = 2,3,4, these are 
approximately Q0.293, (3o.206 and Qo.isg, respectively^ 

SU&D, BCD (r = 0.206) and KR {k = 3) are illustrated in Fig. O - where all 
three were run numerically on the same set of 40 thresholds. As the figure shows, how 
closely an individual experiment's chain tracks around Qp depends upon F's properties 
(the distribution used for Fig. 12.11 is relatively shallow), and also upon the "luck of the 
draw": for example, between trials 15 and 31, low thresholds were frequently encountered, 
and therefore the SU&D and KR chains remained mostly below target. For the BCD 
chain this was compounded by 'unlucky' randomization draws: between trials 21 and 37, 
treatment escalations took 5, 6 and 5 successive 'no's, respectively, while the conditional 
escalation probability per trial (given a 'no ' response) was in fact a little over 1/4. Note also 
that the n + 1-th treatment is determined by n-th outcome, a property used for averaging 
estimators (see Chapter [3|). 



In group 'up-and-down' (GU&D) (jTsutakawal . Il967bl ) instead of a single trial, at 
each stage a cohort of k simultaneous independent trials with the same treatment is per- 
formed. Probabilistically, GU&D does not use a binary-outcome trial, but rather the bi- 
nomial outcome Yg{x) ~ Binomial {k,F{x)), the number of 'yes ' responses observed in the 
cohort. GU&D transition rules stipulate a move up if Yq < a, a move down if Yq > b, 
and staying at the same level otherwise (obviously, < a < b < k). As k increases, the 
increasing number of possible combinations of a and b provides a large variety of targets 
(there are k{k + l)/2 possible designs with co hort size k). Generic stationa ry formulae and 
other key properties of GU&D are detailed by iGezmu and Flournovl (|2006l ) . More recently, 



Ivanova et al. 



(|2007l ) prove that GU&D's mode lies at most 1 spacing unit away from target. 
Here we focus mostly on the GU&D subset with a = 0, 6 = 1 (hereafter: GU&D(fc o,i))5 



^Paradoxically. IWetherill et al.l (|l966l ) derived KR's target in an erroneous way: they calculated the target 
of a GU&D method to be presented below, and attributed it to KR. 

^In sensory studies the method is inverted: one 'no 'response triggers an 'up' move, while k consecutive 
'j/es ' responses are required for a 'down' move. Hence the targets are Q0.707, Q0.794, Qo.84i, etc. 
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K-in-a-Row, k=3 




> 

a> 




Figure 2.1: Sample numerical runs of SU&D (top), BCD (center) and KR (bottom) experiments. 
The latter two were designed to have the same target (F = 0.2063, k = 3). Targets are indicated by 
horizontal dashed lines. 'Yes 'and 'no ' responses are indicated by 'x' and 'o' marks, respectively. The 
design had 10 levels, and thresholds were drawn from a WeibuU distribution with shape parameter 
at 2, = 0.7 at the top level and one spacing interval below the bottom level. 
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whose transition probabilities are equal to those of SU&D performed on a transformed CDF 

H{x) = 1- [1- F{x)f (2.5) 

(and with each single 'transformed trial' representing a cohort of k actual trials). GU&D^j^. g^i) 
targets are identical to those of KR with the same k, thus enabhng direct comparison be- 
tween the three U&D variants described here. 

Another interesting GU&D design family is GV&iD (^k,a,k-a)^ which targets the median 
(SUfcD itself can be r e-defi ned as GU&D(i o,i) and therefore belongs to this family too). 
Gezmu and Flournoyi (|2006l ) found numerically that for a fixed cohort size k, convergence 
accelerates with incr easing a, but the stat ionary distri bution's mod e becomes more shallow 
(see a lso co mment i n 



Ivanova et al. 



20071 ) . Finally, as IStoreii (|l989l ) , llvanova and Flournov 



(j2006l ) and llvanoval (|2006l ) note, the commonly used phase I cancer trial '3+3' method can 
be modeled as an initial GU&D(3 q 2) stage, changing into a size 6 cohort design before 
stopping. The convergence of GU&:D(-3 0,2) will be examined in the next section. That 
design's target is Qo.347- 

2.1.4 The Convergence- Stationarity Tradeoff, Reversals and 'Down-shift' Schemes 
In his seminal work 



Wetherill 



(|l963l ) noticed that U&D designs face an inherent tradeoff. 
Choosing a large spacing reduces the expected number of trials until the chain first reaches 
target, and hence speeds convergence. On the other hand, a smaller spacing generates a 
tighter distribution around target, once it is reached. The most obvious solution is to begin 
with a large spacing, then reduce it at some point (most conveniently by a factor of 2)|3 I 
shall refer to these spacing-reduction designs as "down-shift" schemes. The question is, 
what would be a good point to make the shift, if such a point exists. 

This is where the notion of "reversal" comes into play. A reversal is a point in the 
experiment, where the response is different from the previous, i.e. a reversal at i means 
that Hi 7^ Ui-i- For SU&D, reversals are where the treatment chain reverses direction, hence 
the name. For the U&D variants described above, that is not necessarily th e case; however, 
the term "reversal" has been used for them under the same definition. Brownlee et al. 



(|l953l ) were the fi rst to suggest dis carding all data up to the first reversal, using 'common 



sense' arguments. IWetherilll (jl963l l suggested "down-shifting" at the first reversal, an idea 



that has recurred in different forms since then. Interestingly, "down-shift" schemes triggered 
by the first reversal have been prominent in recent attempts to integrate U&D into model- 



'^This can be seen as a c rude attempt to utilize some of the advantages of stochastic approximation 
iRobbins and Monrol . [l95ll ) . under the constraint of discrete levels. It can also be viewed as a simplified 
version of simulated annealing (V. Minin, personal communication, 2007). 
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( Storer . 


1989. 


2001; 


Potter . 


2002) 



used as a stopping rule, with the experime nt length specified in reversals instead of trials 



( Garcia-Pere 



1998 



Paul and Fishe: 



2001 



The benefit of using t he first reversa 



as a down-shift point has no t been verified theoret' 



ically. Numerically, both iGarcia-Perea (119981 ) and this author (jOronl . |2005| ) have concluded 



Wetherill 



(|l963l ) 



that down-shift schemes are not a guarantee for performance impr ovement. More g enerally, 
the theoretical properties of reversals have not been studied since 

2.1.5 Other Up-and-Down Descendants 

Beyond the straightforward modific ations desc r ibed above, U&D has spawned a wide vari- 
ety of related designs over the years. IWetherilll (|l963l ) was the first to lay out a catalogue of 



such designs. In psychophysics, a p opular family of rela t ed designs mandates a different step 



1998 



Garcia-Perez and Alcala-Quintana 



size f or 'up' and 'down' transitions (jGarcia-Perez 
20051 ). If the step size ratio is irrational, th e range of levels becornes con tinuous. Another 
related family is that of Polya urn schemes (jRosenberger and Grilll . 119971 ) . Here the transi- 
tion probabilities vary as a function of prior responses, according to a predefined algorithm. 
That family has been less extensively studied. 

Given the significant gaps in knowledge about the simpler Markov-chain U&D designs, 
these non-Markovian U&D descendants are outside the scope of my dissertation. 
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Reversals, KR 




Trial 



Figure 2.2: Illustration of U&D reversals (marked in circles), using the same numerical runs from 
Fig. 12.11 SU&D reversals (top) can be identified visually as direction changes. For the other two 
designs, reversals occur at the beginning and end of each uninterrupted descent (of course, for the 
mirror-image versions of these designs targeting above-median percentiles, reversals would occur 
around ascents rather than descents). 
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2.2 Theoretical Study 

2.2.1 Conceptual Prelude 

The U&D design family achieves a remarkable feat: it converts censored, binary-response 
(and therefore indirect) threshold sampling - a rather unfriendly turf for percentile estima- 
tion - back into direct sampling from a quantitative distribution on the threshold/treatment 
scale. This distribution is vr, the stationary distribution of treatments. Even better, vr ap- 
pears to be centered near Qp. These properties are even more remarkable, considering that 
the method was developed not from a deep understanding of Markov chains, but rather 
using ad-hoc intuitive innovation. 

The properties of tt suggest a very natural estimator for Qp-. the empirical mean. How 
well such an estimator performs, depends on how quickly we can assume to be sampling 
from vr, and on how well vr is centered around target. These will be some of the topics 
discussed below. But first, for lack of a standard reference, some foundational definitions of 
basic concepts are needed - as well as a careful study of KR's properties. 

Notational comment: from here on, I will use Fu as a shorthand for F{lu). 



2.2.2 Definition of Target and Mode 



The c oncept of U&D target has been somewhat ill-defined in research. Most (e.g. lGezmu and Flournoy . 



20061) are content with a genera - 



intu itive notion, similar to that of the method's origina- 



tors. IGiovagnoli and Pintacudal (jl998l ) attempted a more rigorous approach; however, they 
end up defining target only for a narrow (and somewhat contrived) subset of U&D designs, 
which does not cover KR or GU&D. I have found the following definition useful, broad and 
rigorous, and in the spirit of the original idea: 



Definition 2 Consider an 'Up-and-Down' design with a stationary distribution tt, and with 
(marginal) stationary 'up' transition probabilities p{x)\^ continuous monotone decreasing in 
X, and 'down' probabilities q{x)\^ continuous monotone increasing in x, respectively. The 
design's target Qp is defined as the x value such that 



p(x)\^ = q{x\ 



(2.6) 



Ivanova et al 



(|2007l ). in a concurrent paper, have suggested a similar definition for 
GU&D's target. In words, the target is the x value from which marginal stationary 'up' 
and 'down' probabilities would be equal, had a design level been placed exactly there. 
It is straightforward to see that this definition does recover SU&D's and BCD's targets 
as the median and Qr, respectively. It is also equivalent to the definition appearing in 



20 



Giovagnoli and Pintacudal (119981 ) on the subset discussed there. The reason for including 



the terms "stationary" and "marginal" in the definition will become clear below. 

This is also a good occasion for defining the notion of a single mode. The definition is 
followed by a useful result about the mode's location w.r.t Qp. 



Definition 3 Let vr he an U&D (marginal) stationary distribution. Then vr will be said to 
have a single mode, if the set of indices M = {u : tTu > tTv < u and 7r„ > vr^/ \/v' > u| 
has either a single member or two members with adjacent indices u,u + 1. 



Lemma 1 If a U&D (marginal) stationary distribution's profile 'juix) is continuous and 
strictly monotone decreasing in x, then 

(i) The distribution has a single mode. 

(a) If there exist two adjacent levels lu*,lu*+i, such that Qp £ [lu* ,lu*+i], then the mode 
is either at 1^* or at lu*+i- 



Proof (i) If 7i < 1 or 7m-i > 1, then the mode is obviously on the lower or upper boundary, 
respectively. In the remaining (and more interesting) case, there has to be some u* such 
that 7n*-i > 1 > 7m*, with equality holding for at most one of the relationships. By the 
definitions of 7 and of the mode, the mode is at Z^*, possibly with either lu*-i or lu*+i (but 
not both). Due to monotonicity, this mode is unique. 

(ii) Note that the following inequality always holds for any u (except on the lower 
boundary - a case that can be omitted here w.l.o.g.), under the specified conditions: 

7„_i > — > 7n- (2.7) 

qu 

Therefore, for all x < Qp — s, 7(2;) > 1 and for all x > Qp, j{x) < 1. Recalling (i), this means 
that the mode lu* has to maintain /„* G [Qp — s, Qp + s]. Together with 7's monotonicity, 
this forces the mode to be within s of Qp. □ 



2.2.3 Basic Properties of 'K-in-a-row' 

We now return to the 'k-in-a-row' design (KR). Unhke SU&D, BCD and GU&D, KR does 
not generate a first-order random walk. A KR run can be described as a k-th order random 
walk with two sets of transition rules, only one of which applies at each trial depending on 
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the string of k most recent treatments (iGezmul . 1 19961 ) : 



Xi = lu, but 3j, i — k < j < i s.t. Xj ^ : 



Xi = Xi-i . . 



Xi—k-\-l — ^u 



Pu 


= 












Tu 


= 1 - 


Fu, 


U = 


2. 


. m 




qu 


— TP 

— ^ u^ 




u = 


2. 


. m 




Pu 


= 1 - 


TP 

^ U ; 


u = 


1 . 


. m — 


1 


Tu 


= 0, 




u = 


1 . 


. m — 


1 


qu 


— TP 

— ^ u^ 




u — 


2. 


. m 





However, this description does not explicitly acknowledge the counting of k 'no' responses 
before an up transition. For this and other purposes, it is more useful to view KR as 
generating a Markov chain whose treatments are paired with a sequence of internal 



states {wn}- Each internal 
total of mk states (c/. e.g. 
transition rules become 



state 



Weiss], 



"an t ake one of k possible values r = ... A; — 1, for a 



19941 . Ch.6). Under this internal-state formulation the 



Pu,T 


= 0, 




u 


= 1 . . 


. TO, 


r < fc - 


1 






= 1 - 


Fu, 


u 


= 2.. 


. TO, 


T < k - 


1 
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1 
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f 1 








Xi+1 


^ Xi 


=> Wi- 


n 


= 











(2.; 



Note that the terms Pu,Qu, still pertain to transition between physical treatment levels, 
and not between internal states. A few more notes on the internal-state formulation: 



• The Wi are uniquely determined by the x^; hence, after the experiment is over it 
suffices to record {x^}, and then {wn} can be reconstructed. 

• In the same vein, a subject receiving a given treatment Xj is oblivious to wi, which 
has no practical effect upon trial i itself. 

• For this reason and also for comparison purposes, we ultimately focus upon marginal 
probabilities and distributions, i.e. values summed over all internal states. The 
marginal stationary frequency of level u will be denoted -Ku = X^,- '^u,t- 
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Theorem 1 (i) KR's marginal stationary distribution profile {ju} is given by 

Fu [1 - Fuf 



(2.9) 



(ii) Let the stationary marginal 'up' probability of a KR design be the weighted average 
PuItt = iYjT '^u,tPu,t) 7r„. Then 

Pu\^ = (2-10) 



(Hi) KR's marginal stationary distribution has a single mode, which is at most 1 spacing 
unit away from Qp under the conditions specified in Lemma{l\ (ii). 

Proof (i) Taking the internal-state approach, the balance equations between adjacent treat- 
ment levels may be written as 

Vl"n,fc-l[l - Fu] = TTu+lFu+1. (2.11) 

Now, transitions between internal states of the same treatment level are possible only for 
single upward increments, each with probability 1 — F^- Therefore, to maintain balance at 
stationarity, the internal state frequencies must obey tTu^t+i = [1 — Fu]iTu,t, t = . . . k — 2 
- a diminishing geometric sequence. This enables us to calculate relative internal-state 
frequencies, and specifically the upper state: 



Plugging back into ()2.1ip . we obtain ()2.9p . 

(ii) This result is immediate from (j2.9p . 

(iii) Differentiating Pul^^ w.r.t to F shows that it is monotone decreasing as F increases 
(a more detailed derivation can be found in Appendix A). Since F itself is monotone in- 
creasing in X, and since qu\^ = F^ is monotone increasing in x, 'y{x) is monotone decreasing. 
Therefore, by Lemma [H there is a single stationary mode at most 1 spacing unit away from 
target. □ 



A somewhat different proof for result (i) appears in iGezmul (|l996l ). 
Under stationarity, the frequencies of internal states within each treatment level form 
a decreasing geometric sequence. This means that the base {vu = 0) internal state is the 
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most common. Intuitively, this is due to the fact that each level transition (whether up or 
down) resets the internal state to zero. It may be of interest to look at the distribution of 
these first visits - i.e., the stationary distribution of a KR chain for which each sojourn at 
a treatment level is collapsed to a single trial corresponding to the base internal state. 

Corollary 1 Consider a subset of any KR experimental chain, composed only of base-state 
trials, i.e. {xi : Wi = 0}. The stationary profile of this subset is 

[1 - Fu]'' , , 

7.,o = (2-13) 

That is, a profile identical to that of a GU&D(^j^ Q i^ design with the same k. 

Proof Using the geometric-series formula as in (|2.12p . 

_ T^u+lfi _ TTm+i Fu+l {l - [1 - 

vr,,o ~ vr„ Fu{l-[l-Fu+in' 

Plugging in 7„ leads to most terms canceling out, yielding (j2.13p . □ 

Conceptually, we can explain this surprising identity: after deleting all trials with Wi > 0, 
one is left with an SU&D-like (or GU&:D(fc o,i)"like) chain, in the sense that it never remains 
at the same level. Now, from a given level's base state, the probability of the next transition 
being an 'up' one is [1 — Fy]^, and the move must be down otherwise - yielding exactly the 
transition probabilities of GU&D^;. o,i) (see (|2.5p ). This result will prove rather useful later 
on. 



2.2.4 Location of the Stationary Mode 

We have established that for all U&D variants discussed here, the stationary mode is one 
of the two levels closest to target. It turns out we can do even better. 

Theorem 2 (i) SU&D stationary mode is located at the level whose F value is closest to 
0.5. 

(ii) GU&D(^i^ Q i'j stationary mode is located at the level whose H value (with H{x) defined 
in 12. 5\) ) is closest to 0.5. 

(Hi) GU&Di^j^ i^ j._g^ stationary mode is located at the level whose F value is closest to 

0.5. 

Proof (i) Let Qp G \lu*,lu*+i] w.l.o.g. Then we can write F„* = 0.5 — Api,F„*_|_i = 



24 



0.5 + Ap2, with Api, Ap2 > 0. Now 

^ Pu* ^ 0-5 + Api ^ 

Therefore, ju* > 1 if and only if Api > Ap2, with equahty only if the two are equal. Since 
the mode is at one of these two levels, our proof is complete. □ 

(ii) This part follows immediately from (i) and from the definition of H{x). 

(iii) First, note that for the binomial cohort-outcome r.v. Yq defined in the previous 
section, Pr {Yg < a\F = 0.5 - A) = Pr (Ig > ^ - a\F = 0.5 + A). Moreover, both sides of 
the equation are monotone increasing in A. Now using the same Api, Ap2 notation of part 
(i), 

. _ Pr(yc<a|F = 0.5-A0 
~ Pv{YG>k-a\F = 0.5 + A2y ^ ^ 

Again, ^u* > 1 if and only if Api > Ap2, with equality only if the two are equal. □ 

For median- finding U&D, the stationary mode is the closest to target on the response 
scale. For GU&D^^ o,i); tl^is is true on the transformed scale of H{x). Since F = 1 — (1 — 
is convex in H, on the original response scale the transformed midpoint lies closer to 
Fu* , the lower of the two levels. This means that the "attraction basin" of the upper level 
is larger, and therefore if the target lies midway between levels on the F scale, the mode 
would revert upward to lu*+i- However, this asymmetry is relatively small: to first order in 
Api,Ap2, the two "attraction basins" are equal in size. 

If one wishes to measure closest treatment on the x scale, exact results depend upon the 
form of F{x). However, to first order in s, it is straightforward to show that the treatment 
midpoint is the boundary between the two "attraction basins" for all designs covered by 
Theorem [21 

The mode location of the two remaining non- median designs is less symmetric. 

Theorem 3 (i) For BCD designs, and using the same terminology as in Theorem the 
stationary mode would revert to lu* unless Fu*+i is closer to p by a factor greater than 
(l-p)/p. 

(ii) For KR designs, the stationary mode would revert to lu* if ■ 



{p - Api) {I - p + Apif < {p + Ap2) l-{l-p + ApiY , (2.16) 
and vice versa. 

Proof (i) This follows directly from substituting Pu*-,(lu*+i from (j2.2p in the formula for 

lu*- 
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(ii) This follows directly from substituting the analogous KR formulae found in the 
previous subsection. □ 

The KR formula is again more complicated; analysis to first order in Api,Ap2 yields 
the approximation 

^ < (2 17) 

Ap2 {2k + l)p-l ^ ■ ^ 

for the mode to revert to For k = 2,3 this yields approximately 1.52, 1.79, respectively, 
for the Api/Ap2 ratio at the "basin boundary". The analogous factors for BCD are 2.41, 
3.85, respectively. So both designs tend to create a mode below target more often than 
above it - a tendency opposite that of GU&D(;j_o,i) ~ but KR 's tendency is quite a bit 



milder than BCD's. This explains the numerical observations of iGezmul (| 19961 ). who noted 
that KR is 'better centered' on target than same-target BCD (see e.g. Fig. 12. As above, 
on the treatment scale these same ratios can be used as approximations to first order in s. 

2.2.5 More Stationary Properties 
Peakedness of non-median-target designs 

Due to limitations related to the unknown location of target relative to the fixed design 
levels, the notion of 'peakedness' is not easily converted to more familiar terminology used 
to describe dispersion (e.g. precision or variance), but it carries a similar meaning within 
the U&D context. The more 'peaked' a stationary U&D distribution is around its mode, the 
closer the method is to the ideal design, and percentile estimati on precision will generally 



i mpro ve. Here is a rigorous definition of 'peakedness', following iGiovagnoli and Pintacuda 



(119981) 



Definition 4 (i) For two 'Up-and-Down' designs, "all other things being equal" will 
mean that both target the same percentile, the threshold distribution F is the same, the 
treatme rit levels |l„,) are the same and th e initial conditions are the same. 



(ii) \Giovaanoli and Pintacuda . \ l99a ) Let designs 1 and 2 be two U&D designs. Then, 
all other things being equal, if > 7!^^ for lu < Qp while 7^^^ < for > Qp 
(i.e., design I's stationary distribution profile is steeper), then design 1 and its stationary 
distribution are called more 'peaked'. 



Thus, for example, to compare KR and BCD we need to examine 

7f^ F^{l-Fu)'-H-p 



iS^^ 1 - (1 - F^f P 



(2.18) 
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Theorem 4 For any k > 1, all other things being equal, KR designs are more 'peaked' than 
BCD designs. 

Proof On target, the ratio in (j2.18p is exactly 1. Therefore, it suffices to show that (|2.18p 
is monotone decreasing in F. Careful differentiation yields this result. □ 

Proposition 1 For KR and GU&D(^f. Q i-^ with the same k, neither design can he said to he 
more 'peaked' than the other. 

Proof The ratio analogous to (|2.18p is 



KR 

lu 



CC^f (fe,o,i) 



Fu 


1 


-il-F^+,f 


Fu+i 





< 1. 



The inequality results from the concavity of 1 — (1 — F))^ in F. GU&D(fc^o,i)'s stationary 
distribution profile is steeper to the left of the peak, while KR's is steeper to the right of 
the peak. □ 

Fig. 12.31 provides a numerical example of 'peakedness'. The differences between methods 
do not appear to be dramatic. On the other hand, if the spacing is too coarse (e.g. m = 5 on 
the left) the stationary peak is very broad and the method's stationary properties offer only 
a modest improvement over standard, non-sequential treatment-response designs. As the 
spacing becomes finer (e.g. m = 10 on the right), a sharp peak forms: under stationarity, 
all 3 methods allocate 50 — 55% of treatments to the two levels closest to target. Thus, finer 
spacing dramatically reduces the chance for treatments allocated too far from target. In 
the scenario illustrated in Fig. 1, with m = 10, Pttj{x > 0.7) (x = 0.7 roughly corresponds 
to the 60th percentile here) is 0.02 to 0.03 depending upon method, while with m = 5 
the analogous probabilities are 0.12 to 0.14. The figure also shows (for m = 10) BCD and 
KR's tendency to have the mode below target, and GU&;D(fc o,i)'s opposite tendency. Here 
Api = 0.061 and Ap2 = 0.042, so KR's "basin boundary" falls just above target on the F 
scale, while BCD's and GU&:D(;(. o,i)'s boundaries are at a larger distance above and below 
target, respectively. 

Peakedness of median-target GU&D designs 



Another result related to GU&D 'peakedness' was numerically observed bv lGezmu and Flournov 
(|2006l ): for median-targeting GU&:D(fc ;,_„) (fe > 2), the stationary distribution appeared 



more peaked with decreasing a. Here is a general proof. 
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Figure 2.3: Stationary distributions for KR ('k' marks and solid lines), BCD ('b' marks 
and dotted lines) and GU&D ('g' marks and dashed lines) - all targeting (5o.293, marked 
with a vertical line. F is Weibull with shape parameter 1.5, and scale normalized so that 
F{1) = 0.8. Shown are a coarser design with m = 5 (left) and a finer design with m = 10 
(right). Also plotted is a standard non-sequential treatment allocation of l/{m — 1) to 
interior levels and l/2(m — 1) to boundary levels ('s' marks and thick lines). The vertical 
axes are normalized to account for the factor 2 difference in m, so that the area under the 
curves is similar in both plots. 
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Theorem 5 For GU&D(^i^ a,k-a) designs (k > 2), and all other things being equal, 'peaked- 
ness' increases with decreasing a. 

Proof Consider design 1 with a and design 2 with a , such that a < a . Since the transition 
probabihties are 'drawn' from the same binomial model, they can be represented thus: 
Pu= Pu + 5pu,q'u = qu + SQu, where dpu = Pr{a < Ig < o' | Fu),6qu = Pr{k - a <Yg < 
k — a I Fu). The 'peakedness' ratio is then 



Pu Qu+l _ (<?«+! + dqu+l)Pu _ Pr{YG>k-a\F^+^) 



Qu+l P'u {Pu + Spu)qu+l 1 + Pria<YG<a'fu) 



(2.19) 



If Fu + Fu+i = 1 (meaning that the target is exactly midway between the two levels if 
distance is measured on the response scale), the ratio is exactly 1. Focusing on the fraction 
on the right-hand-side, the ratio in the numerator is decreasing with F while the one in the 
denominator is increasing. Therefore, the ratio would be < 1 with both levels below target 
and vice versa, and thus by definition design 1 is more 'peaked'. □ 

2.2.6 Convergence Rates 

For U&D methods, the term "convergence rate" can be interpreted in a number of ways. In 
view of theoretical results about the stationary mode, one may be interested in the conver- 
gence of the probability that the empirical mode is the stationary mode, or more generally 
in the convergence of empirical frequencies to their stationary values. On the other hand, 
in view of the prevalence of averaging-based estimators, there is practical interest in the 
convergence of the empirical mean to the stationary mean. Fortunately, the three methods 
discussed here generate simple Markov chains with finite state spaces, and therefore all con- 
vergence rate comparisons should yield equivalent results. Here we focus on the empirical 
mean, which is the simplest, most intuitive, and carries a direct practical significance for 
most applications. 



As lGezmu and Flournovl (|2006l ) note, for finite-state Markov chains the exact treatment 
distribution after i trials can be calculated, given knowledge of initial conditions and of P, 
the design's transition probability matrix (tpm): 



P 



Wt =p(i)tpi-i^ (2.20) 



where p^^^ is an initial probability vector over the levels, f is the transpose operator and 
P^~^ is P raised to the i — 1-th power. As n ^ oo, p^"^ — > vr, regardless of p^^\ Convergence 
rates can be estimated using the tpm. The tpm's of the three Markovian methods discussed 
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here are stochastic and irreducible. Hence, the real parts of their eigenvalues are bounded 
between —1 and 1, and are usually labeled in decreasing order: 1 = Aq > -Re(Ai) > . . . > 
Re{Xm-i) > —1- If the tpm is also reversible, that is niPij = njPji then the eigenvalues 
are all real. In that case, it can be shown that the tota l varia tion distance between p*-") and 



TT converges to zero with a rate (|Diaconis and Stroock 



1991 



ipnjl) 



vr, 



< cA 



2n 
# ' 



(2.21) 



where c is a constant and A# = max (Ai, | \m-i I)- Thus, the rate of convergence is governed 
by the second-largest eigenvalue. For finite-state Markov chains, it is straightforward to 
show that the empirical mean would converge at half the rate of the total variation distance. 

Unfortunately, comparing the 3 methods cannot be trivially reduced to an eigenvalue 
problem: KR tpn i' s (list ing all internal sub-states as separate states) are mkxmk irreversible 
matrices (jGezmul . 1 19961 ). On the other hand, all sub-states of the same external KR level 
actually represent the same experimental treatment. Therefore, a possible fix is to compare 
convergence by observing KR's k x k reversible matrix of marginal transition probabilities, 
using (j2.10p . This is analogous to assuming that in KR experiments internal-state balance 
is achieved much faster than external-level balance. 

One relationship can be directly proven without resorting to eigenvalues. 

Proposition 2 Consider a KR and a GU&Di^j^ Q i-^ design with the same k. Then, all other 
things being equal, the KR design converges faster to its stationary distribution. 

Proof Recall Corollary [1] and the discussion following it, and observe the zero-state-only 
KR subchain. Its transition probabilities are identical to those of GU&:D(;j o i). However, 
while the latter must carry out a cohort of k trials for each transition, the KR subchain 
may take anywhere from 1 to A: trials - on the average, less than k. Therefore, it would 
require less trials to achieve the same probabilistic convergence. □ 

In order to investigate convergence rates more closely, an "all other things being equal" 
numerical time-progression was performed, as in (|2.20p rl A summary of results for conver- 
gence 'upwards' from li and some targets appears in Table 12.11 ('downward' convergence 



*Two sets of initial conditions were examined: beginning at h (similar to typical Phase I conditions) 
or at Im, over a variety of right-skewed, left-skewed and symmetric threshold distributions, 3 values of a 
and fc = 1,2,3 for BCD, KR and GU&D (fe,o,i)i second comparison was performed between BCD and 
GU&D {3,0,2)- Table [2Tll shows a subset of the data. The adjectives 'tight' and 'disperse' for the logistic 
and log-normal scenarios indicate smaller or larger scale parameters, respectively. Convergence rates were 
estimated via an exponential fit on the difference between the mean at time i and the stationary mean. All 
calculations - here and in the rest of the dissertation - were performed in R (|R Development Core Teaml . 
[2005. ). 
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Table 2.1: Comparative summary of convergence calculations for several design scenarios 
and targets. Shown are the number of trials needed to achieve 99% convergence of the 
ensemble mean of Xi to the stationary mean, beginning with the entire probability mass on 
li at trial 1. Numbers were rounded up to the nearest integer, except for GU&D, for which 
they were rounded up to the nearest multiple of k. 



Trials to 99% Convergence 





Distribution 


WeibuU 

1 (Exp.) 


(Shape 

1.5 


Parameter) 

3.5 5 


'Tight' 


fistic 

'Disperse' 


Log 
'Tight' 


;-Nornial 
'Disperse' 




SU&D 


12 


9 


6 


10 


7 


9 


6 


9 




BCD, r = 0.293 


19 


17 


15 


17 


14 


17 


13 


16 


m = 5 


KR, fc = 2 


17 


14 


10 


10 


10 


14 


8 


13 




GU&D (2,04) 


20 


16 


12 


10 


12 


16 


12 


16 




BCD, r = 0.347 


18 


15 


12 


12 


12 


15 


10 


14 




GU&D (3,0,2) 


36 


27 


15 


15 


15 


24 


15 


24 




SU&D 


34 


26 


14 


12 


13 


22 


14 


26 




BCD, r = 0.293 


44 


38 


30 


31 


28 


39 


23 


33 


m = 10 


KR, fc = 2 


39 


32 


21 


21 


20 


32 


17 


27 




GU&D (2,0,1) 


46 


36 


24 


22 


22 


36 


20 


32 




BCD, r = 0.347 


44 


35 


24 


23 


22 


34 


19 


31 




GU&D (3,0,2) 


81 


57 


33 


30 


33 


54 


27 


51 



was faster in most scenarios, but the overall pattern was not substantially different; data 
omitted here). Shown are the number of trials needed for the ensemble mean of the Xj's 
to converge 99% of the way from li to the stationary mean. These numbers can be seen 
as a practical cutoff, beyond which the treatment chain is 'as good as' a sample from vr, at 
least for averaging purposes (one reason for choosing 99% is because halving the numbers 
on Table [2T] conveniently yields the number of trials needed to converge a more lenient 90% 
of the way) . 

Some observations from Table 12.11 

1. SU&D converges faster than any non- median method examined. This is intuitively 
expected, since SU&D requires a only single trial before transition in either direction. 

2. For a given target response rate p, convergence rates are most strongly affected by a 
combination of distribution properties and spacing - specifically, by the difference in 
F values between adjacent levels around target. For example, the "shallowest" CDF 
(exponential thresholds and m = 10) causes the slowest convergence, and vice versa. 
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3. Among the three non-median methods, KR is sohdly ahead, while BCD is slowest. 
Based on these and other numerical runs not shown here, one could expect BCD to 
take around 20 — 40% more trials than same-target KR to reach the same degree of 
convergence. The performance gap increases with k. 

4. The number of trials to 99% convergence can be quite substantial when compared with 
the design limitations of applications such as Phase I clinical trials. Moreover, the 
Markovian method most closely resembling the '3+3' design (GU&D(3 0,2)) converges 
very slowly - even slower than BCD. It takes GU&:D(3 0,2) 50 — 90% more trials than 
KR (k = 2) to achieve the same degree of convergence, even though the former's target 
is closer to the median. Furthermore, recall that these are ensemble averages and not 
single-run statistics. 

Table [2?TT s convergence rate estimates are in rough agreement with eigenvalue predictions for 
BCD and GU&D (data not shown). KR rates show an agreement with the 'marginalized' 
tpm eigenvalue, indicating that indeed it is the between-level convergence that is rate- 
limiting. 

Note that in agreement with Proposition [2l KR converges faster than GU&:D(;j o,i)- Fur- 
thermore, same-targeted BCD is the slowest-converging. This has been observed universally, 
over a much wider range of scenarios than is reported here. I have yet to find a rigorous 
proof why. Can a heuristic explanation be found? 



The KR-BCD convergence gap recalls the observation by iGezmu and Flournovl (|2006l ) 
that for median-targeting GU&D with fixed k, convergence speeds up with increasing a. In 
both cases, the (marginal) pu, Qu of the faster method are always as large or larger than 
those of the slower method - and in such a way that the |p„ — qu\ difference is also larger. 
This means that both the "probability flux" in each direction, and the net flux in direction 
of target, are larger for the faster method. This is formalized in the conjecture below. 

Conjecture 1 Consider two U&D designs: design 1 withp,q,r and design 2 with p' , q' , r' , 
and all other things being equal. If Pu > p'ujQu ^ Qu f^''" levels, and \ p^ — qu+i |^| 
Pu — Qu+i I when lu,lu+i ire both below or above target, then design 1 converges faster. 

Note the similarity to and difference from the 'peakedness' definition. Definition HI Here 
and there we compare up and down probabilities between adjacent levels. However, 'peaked- 
ness' is related to the probability ratio, while the convergence conjecture is related to the 
probability difference (or 'flux'). In the median-targeting GU&D case, increasing a in- 
creases the difference while making the ratio smaller; therefore for this design family one 
faces a tradeoff between fast convergence and sharp stationary distribution. On the other 
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hand, KR maintains both a greater ratio and a larger difference between up and down 
probabihties, compared with BCD - so it has a performance edge on both aspects. 

2. 2. 7 Bias of Early Reversals 
First Reversal 

From this chapter's theoretical analysis, it is now clear that the number of trials to the first 
reversal is a random variable. Of practical interest is the location of first reversal, denoted 
x/jj, conditional upon starting point xi. Those who use the first reversal as a cutoff or 
transition point, implicitly assume that this location is close to target. Let us examine this 
for GU&D(;j_o,i) ^'^'^ KR. For both of them, assuming a start from the lowest level Zi, the 
distribution will be 



This distribution is reminiscent of a generalized geometric r.v., with the reversal defined 
as 'success', and with monotonically decreasing 'failure' (i.e., passage) probabilities at each 
trial. If s is sufficiently small w.r.t. the slope of F, then in the vicinity of Qp the 'up' 
passage probability is approximately 0.5. This means that conditional upon reaching 
the level immediately below Qp, before the first reversal, the expected additional number 
of levels traversed is less than 1 - landing us very close to target. However, the expected 
location of xr^ is a weighted average of this and of the expectation conditional upon not 
reaching /„. : 

E [xR^ =l^ \ xi = li]= E [xR^ =l^\xi = h, XR^ > lu-] Pr {xr^ > lu* \ xi = h) 



+E [xR^ =l^\xi = li,XR^ < lu*] Pr {xR^ < lu* I xi = h) . 

Therefore, under typical design scenarios, if Qp > Is, xr^ should be expected to be well 
below target. Now, suppose xi = lu, with /i < < /„. . Then (j2.23p still holds, but now 
the second term includes the possibility that the chain's first move would be downward. So 
again, if xi is 2 or more levels below target, E [xr-^^] should also be below target. The same 
conclusions hold for BCD since eventual transition probability without a reversal is also 0.5 
on target, and is monotone decreasing with increasing x. On the other hand, if xi is close 
to target, the first reversal will be approximately 'right on'. 

If the experiment begins at the top level Im, for SU&D the outcome will be similar 
to that when starting from bottom. For the non-median methods, downward transition 
probabilities will become 0.5 at the median - long before reaching target. Therefore, if xi is 



Pr (XR, =lu\xi= h) = [l - (1 - Fu)''] Hil- F,f 



(2.22) 
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at or above the median, the upward bias of xr^ should be even greater than the downward 
bias when starting below target. 

This asymmetry suggests that perhaps the interpretation of reversal for non-median- 
target designs may be wrong. Perhaps, instead of "a point where a change in response is 
observed" we should define reversal as a point where a change in chain direction is observed, 
i.e. a move up when the last move was down (ignoring trials that mandate no change). For 
SU&D the two definitions are equivalent. It is easy to see that for non-median-target 
designs the second definition causes a symmetry in behavior, as far as first reversal bias 
is concerned. However, a closer look reveals that direction changes are more likely than 
reversals to occur far from target (to either side; more about this on Section 3.2). All in 
all, the potential benefit from this changed definition is not clear, and is outweighed by the 
added complication. Hence it is more useful to understand the properties of reversals as 
they are currently defined, and plan our design accordingly. 

To sum it up, the location of xr^ is in general biased towards xi, and is not guaranteed 
to be close to target unless xi itself is close. 

Subsequent Reversals 

So we have just learned that unless xi is very close to target, the point of first reversal 
is likely to be on the same side of target as xi (though closer on the average). Since xr^ 
is between xi and xr-^ , the second reversal is an even worse choice for cutoff or transition. 
However, the expected net movement away from target between first and second reversals, 
is obviously smaller than the expected gain between starting point and first reversal (since 
probabilities for moving away from target are smaller than for moving towards it). This 
makes the third reversal an interesting cutoff-point candidate. The next potential can- 
didate is naturally the fifth reversal. Choosing a cutoff point beyond the fifth reversal may 
be too late for many sample-size restricted applications. 



34 



Glossary for Chapters 1-2 

Glossary is alphabetically ordered, starting with acronyms, then mathematical symbols 
(Roman letters first, Greek following). There are additional glossaries for Ch. 3 and Ch.4- 
5, but this one is the most extensive. 

Note: in order to keep the glossary manageable and informative, I chose to omit some 
symbols which make a one-time appearance somewhere in the text, if they have no general 
significance. 



BCD: The Biased-Coin Up-and-Down design developed by lDurham et al.l (|l995l ) 



GU&D: The Group Up-and-Down design first developed by iTsutakawal (Il967bl ). 



KR: The "k-in-a-row" Up-and-Down design, first deve loped by 



Wetherill et al 



Markov-chain properties fir s t desc ribed properly by 



Gezmu 



is taken from 



Ivanova et al 



(11963); 



(|l996l ). The name KR 



(|2003l ): I use it because it better distinguishes KR from 



other U&D variants. 



S UfcD: The original median-targeting Up-and-Down design described by lDixon and Mood 
llMsl)- 



U&D: Up-and-Down. In this thesis, I use the name U&D to refer to the family of 
designs that generate Markov chains whose stationary distribution is centered on the 
target percentile. Includes (though not limited to) BCD, GU&D, KR and SU&D. 



a, b: The lower and upper transition criteria parameters in group U&D designs. If < a 
'yes' responses are observed in the /c-sized cohort, a move 'up' is mandated, and if > 6 
are observed a move 'down' is mandated. 

F: The CDF of thresholds; throughout the thesis, assumed to be strictly increasing 
and to have a density. 

/: The threshold density; first derivative of F. 

F^,: The value of F at treatment level L. 
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H: A transformed CDF of thresholds for a certain family of GU&D designs (see text 
for details). The con cept of "transformed CDF" was first offered in this context by 



Wether ill et al, 



(|l966l ). where it was used erroneously for KR-type designs. 



i: Used for indexing trials and responses. 



k: The number of consecutive 'yes' ('no') responses needed for a move 'up' ('down') 
in below-(above-)median KR designs; the cohort size in group U&D designs. 

lu- Treatment level, indexed u. 



m: The number of treatment levels (if finite) 



A^: Ensemble size (i.e., number of individual runs) used in simulations. 



n: Sample size of a single experiment. 
p: The target response rate. 



P: The transition probability matrix (tpm) of an U&D design. 



Pu: Qu: fu (indexed): The U&D 'up', 'down' and 'stay the same' probabilities, respec- 
tively, at level lu- 

Qp. The p-ih. quantile (or lOOp-th percentile) of F. Qp, with p specifically the target 
response rate, is the experiment's target. 

Rj'. Index denoting the location (between 1 and n) of the j-ih reversal point. 

s: The spacing between adjacent treatment levels (assuming levels are evenly spaced). 

T: The response-threshold r.v. Viewing a binary-response experiment as thresholds 
being triggered by the treatments, underlies the entire statistical approach to the 
percentile-finding problem. Individual thresholds are then assumed to come from 
some (generally unknown) distribution with a density /. 



t: Actual values that T takes. 



u,v,u ,v : Used for indexing treatment levels. 
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Wf. The internal state at trial i in a KR experiment, so that the current state of the 
experiment is fully described by the pair (xi,Wi). w can take integer values from to 
k-1. 

X (generic) : The treatment as an independent variable (in the algebraic sense) . 
Xi (indexed): The treatment value at trial i of an U&D design. 

Hi (indexed): The binary response at trial i. 

Yq'- The cohort response variable in group U&D experiments. It is binomial rather 
than binary. 

V: The BCD "biased-coin" parameter. It is equal to p, but is retained here for ter- 
minological compatibility with other publications. 

7^: The "stationary profile" : the ratio between adjacent stationary frequencies, tTu+i/tt, 

7(x): The "stationary profile" as an algebraic function of the independent variable x. 

\u- The n-th eigenvalue of the tpm P, arranged in decreasing order. 

tt: The stationary and/or limiting distribution of treatment allocations. 

p*^*): The probability distribution of treatments (across levels) at trial i. 

t: Used for indexing internal states in KR designs. 
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Chapter 3 

NONPARAMETRIC UP-AND-DOWN ESTIMATION 
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3.1 Existing Estimators and Current Knowledge 

Nearly all U&D target estimators can be divided into two groups: averaging estimators, 
making direct use of the treatment chain {x„} only; and response-based estimators, 
making use of both treatments and responses. This chapter will discuss each type separately, 
in alternating order. Only the numerical study (Section 13. 4p will compare them to each 
other. As the chapter's title indicates, the focus is solely on nonparametric estimators. 



3.1.1 Averaging Estimators 



History and Basic Description 

Early U&D estimators were modified averages of the tr e atment chain, with th e mod i fication 



1948 



Brownlee et al. 



1953 



Dixon 



coeffi ci ents based o n norin al theory (jPixon and Mood 
19651 ). iTsutakawal (jl967bl ) studied the properties of an average of all trials beginning with 
X2, and found a nonparametric expression for its variance (see further below). However, the 
U&D estimator of choice in most field s is reversal averaging, introduced by Wetherill and 
aptly named w (|wetherill et al.l . ll966l ^FI The estimator w is a somewhat modified arithmetic 



average of x at reversal points {xr^} (reversals were defined in Section [27T]) : 

EjXR^+XR^-l 



w 



2nR 



(3.1) 



where hr is t he number of reversals. A straightforward reversal average is known as w, 

J). Estimators belonging to the iv or w type (i.e., some average of 



Choi (1971 



introduced by 

treatments at reversals only, beginning from a certain cutoff point) appear to be the most 
commonly used U&D estimators in median-finding applications across most relevant fields, 
and in psychoph ysics e xperim ents using KR. From here on I will refer to this estimator 



family as t&o As Choi ( 1971 ) pointed out, for SU&D w is identical to the simpler w when 
ur is even, and differs from it by ibs/2n/j when ur is odd. For KR and other non-median 
methods, this is not the case, but the two are still very closely related. 

Considering its prevalent use, w has rece i ved very sc ant theoretical attention in its 40 

(jl985l . 119871 ) looked at stationary properties. 



Kershaw 



years of existence. In the 1980's, 
especially variance, and concluded that w is in fact inferior to Dixon and Mood's original 



^Interestingly, Wetherill's reasoning for favoring reversals over all treatments was less theoretical than 
numerical: this estimator performed somewhat better in simulation. 

^It should be noted that according to lPace and Stvlianoul (l2007h 's rec ent meta-analysis of 16 U fcD anes- 
thesiology experiments published in 2000-2006, half the studies stiU used lDixon and MoodI (|l948l ')'s original 
estimator, with w only second in popularity. 
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estimator. Kershaw's conclusions were la rgely ignored, and w c ontinues to be the most 
popular U&D estimator. More recently, iPaul and Fished (|200ll ) conducted a numerical 
study of w in the context of anesthesiology experiments. They concluded that 6 or more 
reversals are needed for reliable estimation, contrary to a prevalent practice by some groups 
in that fie l d tha t use only 4. 



DixonI (|l965l ) claimed without proof, that SUfcD's empirical mean (ap parently under 



stationarity) is unbiased if F has a symmetric density. iGarcia-Perea (119981 ) simulated sta- 
tionary SU&D and KR runs (using the last 9000 reversals from runs having 10000 reversals 
each), finding that for KR w is biased towards the median (a bias that is of course zero for 
SU&D). 

Finally, th e empirical mode can be seen as related to averaging estimators. Markov- 
chain studies (IDurham et al.l . Il995l : iGiovagnoli and Pintacudal . Il998l ) prove that the empir- 
ical mode converges to the stationary mode, which is guaranteed to be one of the 2 levels 
closest to target. However, we saw in Section 12.21 that the stationary mode may end up 
being only the second-closest level to target, especially for BCD and KR designs. Addition- 
ally, the mode is restricted to be at a treatment level, while averages can vary on a finer 
scale and hence are more precis e. It is therefore not surprising that numerica l stud i es find 



this e stimator performs poorly ([Durham et al.l . 119971 : IStvlianou and Flournovl . 
20051 ) . The empirical mode will not be discussed further. 



2002 



Qronl . 



Mitigating The Starting-Point Effect 

Perhaps the most obvious problem with averaging estimators is starting-point bias. The 
first treatment xi is completely arbitrary (even if determined with good judgment). Then, 
somewhere during the experiment, the chain begins meandering around target. But at 
what point does the chain become "good enough" to include in an averaging estimator? 
Since during U&D's formative years it was not understood to generate a Marko v chain. 



solution approaches had developed gradually and intuitively. iBrownlee et al.l (|l953l ) recom- 
mended discarding xi (they were also the first to note that Xn+i is uniquely d etermined 
by the experiin e nt, an d incorporated it into the estimate instead). iDixonI (|l965l ) and then 



Wether ill et al.l (119661 ) recommended starting from the first reversal. Much more recently, 
Garcia- Perezj ( 19981 ). based on extensive KR simulations, recommended starting from the 



2nd or 3rd reversal. In practice, most researchers use all reversals for w, but some discard 
the first 1 - 4§ 



^Garcia-Perej l| 19981 ) tabulates a meta-analysis of KR vision research experiments published in the mid- 
1990's. The vast majority of the 82 studies used some form of reversal averaging. Out of 60 reversal- 
averaging studies for which information was available about initial-sequence removal, 32 removed no 
reversals, 14 removed the first 1-2 reversals, 12 removed 3 or more reversals, and 2 used some other 
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In the previous chapter, we have seen that in general the first reversal point is quite 
likely to occur while the starting-point bias is still substantial. We will revisit this issue, 
and novel ways to tackle it, later on. 



Mitigating The Boundary Effect 

When the treatment set is finite and Qp lies too close to a design boundary, averaging esti- 
mators will display a strong bias away from that bou ndary (this point will be explained in 
more detail in the next section). iGarcia-Perezj (|l998l ) offers an interesting fix for this prob- 
lem, dubbed "layover boundary conditions": one continues the experimental bookkeeping 
(for averaging purposes) as if there was no boundary. However, in case levels beyond the 
boundary are indicated, the boundary treatment is administered. For example, if = Ij 
s = 0.1 and the transition rules indicate an upward move, the next treatment will be 
recorded as 1.1 but administered at 1. Afterwards, the chain will have to move back down 
"through" 1 before reaching 0.9. 

Layover boundary conditions are equivalent to assuming that F remains constant outside 
the design boundaries. The treatment chain becomes infinite-state, but since transition 
probabilities beyond the true boundaries are equal to those at the boundaries, the stationary 
frequencies of the 'phantom' levels constitute a diminishing geometric sequence, and so vr 
still exists. This issue, too, will be inspected more closely later on. 



3.1.2 Response Based Estimation: Isotonic Regression 

Response-based averages use {Fm}, the empirical binomial point estimates of F at design 
points, created by tabulating the responses: 



Fu 



In] 



(3.2) 



Probit and logit regres sion are occasionally enocountered in U&D applications (|Storer 



Camorcia et al 



1993 



20041 ) . Being parametric, they exhibit the usual pros and cons, with their 



quality depending mostly on whether the model family provides a fair approximation of F. 
Parametric U&D estimators will not be discussed here. 

A nonparametric response-based estimator - isotonic regression, which is in fact the 



NPMLE - was recent 



y introduced to UfcD in the Phase I context, and has been recom- 



mended by the NIH (jStvlianou and Flournovl . l2002l : IStvlianou et al.l . l2003l : 



NIEHS 



2001 



This innovation has yet to reach all other fields using U&D; and perhaps for the better, since 



criterion. 
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it was not directly compared with w which it aims to replace. This comparison appears 
further below in the numerical study. 

Isotonic regres sion (IR) is a well-known nonparametric so lution for a variety of sta- 



tistical problems ([Barlow et al 



1972; 



Robertson et al 



19881 ) ■ IR can be produced via 



a couple of lengthy-named algorithms (pool-adjacent-violators-algorithm, or PAVA; least- 
convex-minorant), but in fixed treatment-response design context it boils down to a set of 
binomial point estimates: the original point estimate Fu wherever there is no monotonicity 
violation, and an estimate pooled from the responses at adjacent levels in case of a violation. 
The procedure is described in Algorithm 1 below. 



Algorithm 1 Pool- Adjacent- Violators Algorithm (PAVA), Adapted for Treatment- 
Response Designs 

procedure PAVA({2;m}, {n^}) 

while H = {u : 1 < u < m, Zu > Zu+i} ^ do 
i <— min(H) 

M ^ 1 

while Zi > Zi^]\j do 

Zi = ... = Zi+M Zi:i+M = Yl'uJf nuZu, where n„ = n«/ (Yl'jtf '^i) 

M ^ M + 1 
end while 
end while 

Return {zm}- 
end procedure 



PAVA replaces any subsequence of {Fm} it deems "violating" by a sequence that is a 
repetition of a single value (which in the case of treatment-responses designs is simply the 
overall proportion of 'yes' responses over the subsequence). This means that the IR estimate 
of F is constant over any interval of x values covered by a violating subsequence. In intervals 
between non-violating design points, IR provides no unique estimate. Since in order to find 
the percentile one has to perform an inverse estimation, we need a (forward) estimate of F 
over the entire range. In tr eatment-respq r ise applications this problem is comm only solved 
by linear interpolation (e.g. 



Dilleen et al. 



2003 



Stvlianou and Flournovl . 



200: 



The operation of IR on treatment-response data is illustrated in Table 13.11 and Fig. 13.11 
Shown are two simulated runs, in each of which there was a single violating pair. Assuming 



^apparently, in some fields people still believe that a straight line is the simplest way to connect two 
points. So statisticians have their work cut out for them :) 
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Table 3.1: Summary tables illustrating how IR works in practice on treatment-response 
data. Data are taken from two simulated GU&D(2,o,i) experiments with the same threshold 
distribution and n = 32. Yes/no summaries from sequences of violating points are pooled 
together to a single point estimate, which is then used for all original points in the sequence. 



Treatment Raw Input IR Output Raw Input IR Output 

Yes No F Yes No F Yes No F Yes No F 



0.17 





4 


0.00 





4 


0.00 


1 


7 


0.13 


1 


7 


0.13 


0.33 


3 


9 


0.25 


3 


9 


0.25 


4 


8 


0.33 


6 


14 


0.30 


0.50 


3 


7 


0.30 




10 


0.28 


2 


6 


0.25 


0.30 


0.67 


1 


3 


0.25 


4 


0.28 


4 





1.00 


4 





1.00 


0.83 


1 


1 


0.50 


1 


1 


0.50 















(Simulation Run 14) (Simulation Run 9) 



the target is Qo.3) in Run 14 (LHS of both table and figure) = p exactly; however, > 
F4. Therefore, technically we cannot determine a unique estimate via linear interpolation 
between F values; and conceptually, neither F^ nor F4 can be fully "trusted" without 
more information. IR resolves this by pooling the data from points 3, 4 together; the final 
estimate for Q0.3 is to the right of point 4. In Run 9 (RHS) we see an (extreme, but possible 
whenever p is rational) example for IR's limitations: the flat stretch produced by the process 
lies exactly at F = 0.3, and therefore IR does not provide a unique inverse estimate without 
some additional convention (taking the highest/lowest/midpoint of possible estimates, etc.). 
These two numerical examples will be revisited in Section 13. 3[ 

Why is IR the NPMLE for our application? Let us examine U&D's nonpar ametric 
likelihood; it can be written as 

£ = Pr ({j;„,y„}) = 117=2 • ■■Xi-i,yi . . . yi^i)Fr {yi\xi ...Xi,yi. ..yi-i) 

= const. X nr=2P^(y*l^i) (3-3) 

= const. X n;r=i ^^""^-(1 - f„)""(i-^"), 

where the F^ were defined above in (|3.2p . The derivation uses the following properties: 

• Given each treatment Xj, the corresponding response yi depends on no other treatment 
or response, since the underlying threshold sampling is i.i.d.; 

• Given all previous treatments and responses, Xj can be found either deterministically, 
or (for BCD) using a separate, independent random draw which is unrelated to F. 
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Isotonic Regression: Run 14 



Isotonic Regression: Run 9 



CO 

I 



00 

o 



CO 

o 



o 



o 
o 




1 \ \ \ \ \ r 

0.2 0.3 0.4 0.5 0.6 0.7 0.8 
Treatment 



CO 

I 



00 

o 



O 



O 



O 

o 




1 \ \ \ \ \ r 

0.2 0.3 0.4 0.5 0.6 0.7 0.8 
Treatment 



Figure 3.1: Graphs illustrating how IR works in practice on treatment-response data. Data 
are the same as shown on Table 13.11 Raw F values are in 'X' marks, and the IR output 
values are in circles connected by solid lines. The dashed red lines indicate the inverse- 
interpolation process. Note that in Run 9 (right), the IR estimate for Qo.3 is not unique, 
and could be any value in [1/3, 1/2]. 



Hence, the "const" appearing in the equation is either a product of functions of the 
biased-coin parameter T (for BCD), or simply 1 (for all other designs). 

The form of (j3.3p is identical to a product of binomial likelihoods, except for having 
different constant coefficients (the binomial likelihood has combinatorial coefficients). A 
product of binomials is the nonparametric likelihood of a standard non-sequential treatment- 
response experiment. Note that the nonparametric approach yields a likelihood that can be 
interpreted parametrically, as having m parameters representing the true values {Fm}. 

Now, since the log-likelihood takes the form 

m 

log £ = const. + ^ n„ {f„ log F„ + (1 - F^) log [1 - , (3.4) 

u=l 

the unconstrained NPMLE is of course just {Fm}. Under monotonicity constraints, pro- 
ceeding recursively from the first monotonicity-violating pair of F values and treating 
one pair at a time, it can be easily shown that the constrained NPMLE is identical to 



PAVA's output. More detailed derivations appear in standard references ([Barlow et al 
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1972 : lEobertson et alJ . Il988l ) . 

The "flat" intervals produced by PAVA in case of a monotonicity violation in {F^} ap- 
pear to hurt estimation performance, both forward and inverse. 1 will inspect this suspicion 
more closely in the next section. 



3.1.3 Interval Estimation 

Nonparametric U&D interval estimation has been even more seldom studied than point 
estimation. However, interest in this topic as well has been picking up recently. 



SD-Based Approaches 



Dixon and Mood 



on normal theory. 



(ll948lVs original averaging estimator came with a variance estimator based 



Pollak et al 



(|2006l ). using numerical simulations of material- fatigue 
SU&D testing, show that this variance estimator performs very poorly, and is strongly 
dependent upon the spacing s. They suggest a 3-parameter empirical correction function 
fitted to their data. 

In anesthesiology, some researchers using SU&D with w report confidence intervals. In 
calculating standard errors, they seem to count each pair of reversals as a single independent 
observation, and assume asymp totic normality of th e average, hence the in ethod's name: 



"in dependent paired r eversals" (jCapogna et al 



2001 



Camorcia et al. 



20O 



Tsutakawal (jl967bl ) pursued a Markov-chain based approach. He noted that a stationary 



Markov chain can be split into subchains, identified by successive hitting times at a given 
state - i.e., the points at which the treatment chain visits the state. The lengths of these 
subchains, {i^j}, are i.i.d. random v ariables, as i s any a. lgebraic functional on the treatments, 
cal culated separa tely per subchain. 



Tsutakawa 



on 



(jl967bl ) then goes on, basing his calculations 



Chuna (|l96d . Ch. 1), and claims that 



'^3 3 



Var{x) 



(3.5) 



as n — > oo. Therefore, he proposes the LHS of (jS.Sp as an estimator of the variance for 
the empirical-mean estimator. However, from data tables in that article, as well as my own 
numerical trials, it seems that the LHS approaches the limit from below, and therefore the 
resulting Cl's are too optimistic. 



^To be honest, up till now (summer 2007) I have yet to find a full documentation, nor a reference 
for "independent paired reversals". The observation reported here is based on "reverse engineering" of 
reported estimates in anesthesiology articles using the raw data, and I am not 100% sure that it is correct. 
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Another approach to averaging-estimator confidence intervals was presented by I Choi 



(|l99d ). Starting from first principles, the variance of the average of n r.v.'s {Vi, . . . Vn} is 

E^Var{V,) + 2Z^<^CoviVi,VJ) 



Var [Vn 



(3.6) 



Now, being part of a Markov chain, x values at reversals are clearly dependent. Assuming 
that they are ide ntically distributed and that dependence follows an AR(1) autoregressive 
form, IChoil (jl99Cll ) obtains formulae for the s.e. of t() as a function of the threshold s.d. a and 



the autocorrelation coef ficient p. Additionally assuming normally-distributed thresholds, 

CI's can be calculated. IChoil ([l990|) found them to be overly optimistic, but better than 



Dixon and Mood's original CI's. 



Bootstrap Approaches 

Recently, use of the b ootstrap for U&D CI's was exp^ 



(IChao and Fuh 



2001 



ored - both for param etric estimation 



y) and for for isotonic regression ([Stvlianou et al.l . l2003l ) . The bootstrap 
approach for U&D is to simulate chains of length n, with transition probabilities determined 
via some estimate of F (IR or a parametric regression). In both cases, however, the boot- 
strap was used not for CI's around Qp, but for CI's around the forward estimate of F at 
the cl osest design leve l. IChao and Fuhl (120011 ) used the ordinary boot strap and bootstrap-t . 



while 



19951 . Ch. 5) 



ical fix to 



Pollak et al 



Stvlianou et al.l (12003) use d the bias-corrected bootstrap (see iDavison and Hinklev 



20061) explored using the bootstrap combined with their empir- 



Dixon and MoodI (|l948l )'s formula, and reported good results. 



Use of Current- Status Theory 

Recalling that U&D is a special case of current-stat us data, there is a wealth of th eoretically- 
derived interval estimation knowledge in that field ([Baneriee and Wellneii . l2005l ). However, 
U&D appears to be too special a case: under the typical current-status assumptions, treat- 
ments are random with a continuous CDF. This leads to an n^/^ rate of convergence every- 
where for the NPMLE. U&D, though its treatments are random, is more akin to standard 
treatment-response designs in that treatments are limited to a discrete finite grid - while 
the continuity of treatments is a key assumption in most current-status CI discussions. As 
a result, the point estimates {Fm} converge with an n^/^ rate, and the linearly-interpolated 
IR used in U&D converges at the same rate - but not to F, but rather to -F - a linear 
interpolation of F between the design points. Therefore, it appears that current-status CI 
theory is not applicable to U&D interval estimation. 
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3.2 Theoretical Study 

3.2.1 Conceptual Prelude 

We have at our disposal two nonparametric approaches to U&D estimation. The response- 
based one is generic to any binary-outcome design. It does not make direct use of the 
properties of vr, but benefits indirectly from having more precise F estimates in the vicinity 
of Qp. Averaging estimators, on the other hand, are intimately related to the peculiar 
properties of U&D sampling. 

Considering the stationary properties already discussed here, one can say that some 
sort of empirical treatment mean is the "natural estimator" for U&D designs. I 
put forward this rather strong statement, in order to re-emphasize the conversion trick that 
underpins averaging estimation. U&D converts censored sampling back into direct sampling 
- which, as suggested in Section 11.2.11 would be far more desirable than indirect, binary 
censored sampling. 

However, to utilize the potential of averaging estimators, we need to better understand 
the properties of vr and of random- walk sampling from it. Unfortunately, since the averaging 
estimators currently in use were developed without the Markovian perspective, they do not 
utilize the full estimation potential. In particular, choosing to average reversal points means 
that we are not sampling from vr itself, but from its "sibling": the stationary distribution 
of reversals. Much of this section will be devoted to untangling the peculiar properties of 
such sampling, and to suggesting other approaches. 

Meanwhile, even though isotonic regression is not "natural" in the sense discussed above, 
this also means that it suffers less from U&D's peculiarities and limitations, such as starting- 
point bias or boundary effects. Therefore we expect it to be more robust, and will look at 
its properties more carefully later on. 

3.2.2 Estimation Properties of Treatment Means 
Bias of the Stationary Mean 

We have already seen that for median-targeting designs and for GU&D(fc o,i)i the stationary 
mode is on the closest treatment to target on the response scale - which is also the closest on 
the threshold or treatment scale (to first order in s). BCD and KR tend to shift the mode 
somewhat to the left of target. One could expect, therefore, that with the former variants 
the stationary mean jji^^ would show little or no bias w.r.t. Qp, while with the latter there 
would be a downward bias, or more generally, a bias away from the median. 

More insight into stationary biases can be gained via a telescopic-series analysis. First, 
note that in the limit of infinitely fine design (s — > 0), the stationary biases of BCD and KR, 



47 



too, should vanish. Therefore, we may expect the bias magnitude to be related to s. In the 
subsequent discussion, we assume that there are no design boundaries (i.e., the treatment 
set is potentially infinite) that F is twice continuously differentiable in x, and also - for 
convenience - that Qp falls exactly on (lu* (letting Qp land midway between design levels 
is nearly as convenient) |§ One can then express the stationary mean as a telescopic series 
around target: 



i^TT = ^ T^ulu = Qp + j{'^u*+j - TTn* 



(3.7) 



Qp + STTu- ^ . 



u*+j-l u*-l 

n T.- n 7.-' 

v=u* v=u*—j 



Since the stationary frequencies diminish in a faster-than-geometric rate away from Qp, the 
sum is typically dominated by the first 1 — 3 summand pairs. The pairs are composed of 
opposite-sign terms, which may nearly cancel out. We now carry out a detailed calculation 
of the first pair for SU&D. 

First, note that if F has a symmetric density, for SU&D the terms in each pair will 
cancel each other exactly. So a part from des ign-offset biases, SU&D averaging estimation is 
indeed unbiased as claimed bv iDixonI (|l965l ). For asymmetric distributions, we would need 
standard Taylor expansions around target: 



1 



Fu*+i 
1 



F, 



fu* ^ V^'^«* ~ fu*^^ 



F\ 



+ 



Sfu* 



1 - F,* 



u*~l 



1 - F,, 



s' (2/2.+/:,(l-F,. 
+ 2(1-F,„)3 



(3.8) 



+ ..., 



where / is the threshold density and / its derivative. Substituting this and setting Fu 
0.5, the first summand pair then becomes 



7n* -7«.-l 



0.5 



2_4./,,+2sM4/,^.-/„ 



-0.5 



+ ... 
+ ■■■ 



(3.9) 



^Numerical runs (data not shown) seem to indicate that biases related to design offset w.r.t to target are 
much smaUer than the spacing-related bias discussed here. 
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The first-order terms cancel out, and we are left with a bias that is second-order in s with 
a sign opposite that of /'(Qo.s) (for unimodal or monotone-decreasing threshold densities, 
this means a bias in the direction of F's skew). It is quite straightforward to show that 
all summand pairs in ()3.7p for SU&D give similar negative second-order terms when thus 
expanded. Therefore, it appears that SU&D's stationary mean is a biased estimator of Qo.s 
for asymmetric densities. However, this bias can be kept very small under reasonably fine 
spacing. 

The same analysis for BCD yields 



7n* -lu'-l 



1 



Fu*+i 1 
2p-l 

p{l-p) 



and for KR 



7n* - 7u' 



P 



1 



1 



sfu* < 0, 



(3.10) 



(1 

+ l)p-l 



Fu* 



(3.11) 



p{l-p) 



-sfu* < 0. 



In both cases there is a downward (i.e., away from median) first-order bias. BCD's bias 
term is larger than same-targeted KR, by a factor of about 5/3 (the exact ratio as a function 
of k is found by substituting KR's target (j2.4p for p). GU&D(^ q^^) "inherits" the properties 
of SU&D, meaning that its bias is only second-order and therefore usually smaller than 
that of the other two methods. For symmetric or upward-skewed threshold distributions (a 
realistic assumption when dealing with positive thresholds), GU&D(fc o,i)'s bias is positive, 
i.e., towards the median. 

So we have seen that stationary biases are approximately first-order or second-order 
in the spacing, depending upon the design. All these finds are in line with numerical 
observations such as Fig. 12.31 and with the theory in Section 12. 2i Numerical results indicate 
that for all four variants, stationary biases are quite moderate except when using very coarse 
spacing. 



3.2.3 Averaging All Treatments or Only Reversals? 



The simplest candidate to replace the reversal average w is some average of all treat- 
ments (possibly start ing from a certain cutoff point); let us call this latter estimator v. 
Wetherill et al.l (|l966l )'s simulations showed w as having smaller MSE than v, and this is 
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why it was chosen. Since then, other simulations (e.g., iKershawl . Il985l ) have shown the 
opposite. Can we make a more theoretical comparison between the two options? 



Reversals as a Filter 

Assuming stationarity, the frequency of reversals at each level should also become stationary. 
However, the stationary distribution of reversals is not necessarily identical to vr. We can 
calculate it using tt and the balance equations: 

^ (rev.) ^ ' ' 

T.v^v 

^ T^u [{qu + ru)Fu +Pu{l- Fy)] 

E(rev.) 
v< 

_ T^u \pu + Fu- 2puFu] 

Elrev.) ' 
v^v 

where we implicitly assumed that 'up' and 'down' transitions occur only after a 'no' or 
a 'yes', respectively, and that staying at the same level (if possible) also follows a 'no'. 
Conveniently enough, the (relative) stationary reversal frequency at a given level is a linear 
function of vr at the same level. This means, that picking only reversals can be viewed as a 
secondary, linearly filtered sampling from vr. 

For SU&D, ([XT2]) translates to 



vr(r^^-) oc vr,, 



Fl + {l-Fyf , (3.13) 



where the normalizing proportionality factor has been omitted. So the filter is distorting: 
levels far from the median are over-sampled compared with those near the median, by a 
factor of up to 2 - which means that the stationary variance of reversals is larger than that of 
the entire chain. If the threshold density is exactly symmetric, the sampling does not induce 
a bias to the averaging estimator. But if it is asymmetric, the sampling induces a slight bias 
in direction of the distribution's skew - i.e., in the same direction as the stationary- mean 
bias. 

For BCD, vr^'^'^^-) is proportional to 

4''"^ oc T^u-^{1 - Fu){l - 2Fu) + Fu, (3.14) 
which reduces to (|3.13|) under F = 0.5. The sampling coefficient's minimum point, as a 
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function of F, is at 

AT _ "I 

pimin.) ^ (3.15) 

The minimum moves to the left of T as F decreases; it is positioned at zero for T = 0.25. 
Thus, for most or all of F's range, sampHng probabihty increases with F - meaning that 
BCD reversal averaging induces a bias towards the median. Here, too, we expect vr^'"'^^ -' to 
have larger variance, since the filtering is more aggressive near target. 
For KR we get 

oc v^ /"[^~^,^-^V^^^'^ ■ (3-16) 



Again, the formula reduces to (j3.13p for k = 1. KR sampling has the same qualitative 
properties as BCD (i.e., bias towards the median and a larger stationary variance), albeit 
milder. The filtered sampling coefficient has a minimum slightly to the left of target. 

Note that for BCD and KR, t&'s additional bias is in the opposite direction of ^^'s 
stationary- mean bias. This leads to the interesting question whether the two cancel out. 
First-order analysis indicates that the reversal bias may be of comparable magnitude, but 
is generally smaller than the stationary-mean bias. The exact interplay between the two 
would depend upon specific distribution and spacing conditions. 



Variance Comparison 

One way to generically describe the variance of a dependent-sample average is 

a, = (3.17) 

where JT-e// is an effective sample size, usually smaller than n. We have seen that under 
stationarity, (Ttt for reversals is larger than for all trials. What about ?^e//? This is a measure 
of the information content in the sample about the distribution of trials (or reversals). But 
reversals are a subset of the sample used for v; their information content cannot be larger. 
Therefore, Var{w) > Var{v). 

Instead of this intuitive and not quite rigorous approach, it is also possible to obtain the 
same result via a direct and more tedious approach for some of the designs. 

Proposition 3 For SU&D and G'f/fi/Z)(^. g^i) designs, under stationarity V ar (w) > Var{v). 



Proof We slice the chain into subchains, each beginning with a reversal and ending right 
before the next reversal. For each subchain indexed j, let aj be the subchain's treatment 
mean. A simple average of aj is equal to w which is known to be equivalent to w. Meanwhile, 
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V is an average of the a^'s, weighted by the subchain length nj. Which one has smaller 
variance? 

If shorter subchains have a higher probability of being farther away from /i^, the popu- 
lation mean of the a^-'s (which is equal to then definitely Var^w) > Var(v), since 
the latter down- weights them. Now, at a given level lu, the length of a subchain ending 
with the next reversal is a generalized geometric r.v. with increasing "success" probabilities 
("success" being a reversal). However, the subchain-terminating "success" probabilities are 
always larger in the direction away from Qp, which (ignoring boundary conditions) is very 
close to fia- Therefore, subchains are expected to be shorter, the farther their center is from 
fia, and hence Var (w) > Var (v). □ 

For BCD and KR, this approach encounters some difficulties. Now v is not a weighted 
average of o^'s, but somewhat smaller than that average because ascent subchains are 
"bottom-heavy" (recall that w is biased upwards w.r.t ^tt)- Subchains still tend to be 
shorter when they are oriented away from Qp, but because of the upward bias we are not 
quite sure how close Qp is to fia- However, since the biases (of fij^ w.r.t to Qp and of fia w.r.t 
//tt) are opposite in sign, there is little reason to expect a different relationship between the 
two variances for BCD and KR. 

3.2.4 Isotonic Regression and Bias 

As detailed in Section [3. H whenever isotonic regression (IR) departs from simple linear in- 
terpolation, it produces constant intervals. Besides being problematic for inverse estimation 
of quantiles, since we assumed F to be continuous strictly increasing this may generate bi- 
ases. Specifically, if F is as assumed, then any estimator z{t) producing a constant interval 
z = c Vt G [^1,^2] is either unbiased only at a single point on [ti,t2]! or is biased across 
the entire interval. Common sense suggests that if IR is based on unbiased estimates {Fm}, 
then the former case is far more likely. This can in fact be proven in some specific scenarios, 
such as normal errors with variance inversely proportional to the IR weights. 
The proof requires two preparatory lemmas. 

Lemma 2 Assume that at each treatment level lu we have a raw point estimate F^, which 
is normal with mean F^ and variance Vq/uu, and that all point estimates are mutually 
independent. Then for each sequence of points indexed u . . .u + M which are replaced by a 
single value in the PAV algorithm (hereafter: a violating interval), the pooled IR estimate 
is independent of the degree of the last violation in the sequence, i.e. 




(3.18) 
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where Fu-.u+M denotes a weighted average of F values from lu through lu+M- 
Proof Let Y = (Yi, Y2) be independent normal r.v.'s with variance 



V 



Vi 
V2 



Then Z , a Hnear transformation of y, Z = AY, wih be normal with variance Var[Z] = 

AVA'. Specifically, Z = (Y2 — Yi, + , with K some constant, will have a 

variance 

Therefore, Zi ± Z2. 

Now take a violating interval extending from 1^ to lu+M, and set Yi = -Fm:u+m-1)^2 = 
Fu+M- Then 

Vo 



Var { Fu+M 

while 



nu+M 



Var[Fu..u+M-.)= 2 E ^ ^u+. 



+M-1 



The two are independent, and 

Fu:u+M = K 



^u+M-l \ 

f^i) 1 Fu:u+M-1 + nu+MFu+M 



i.e., an average of the two with weights inversely proportional to variances. By the proof 
above, this average is independent of the difference Y2 — Yi, which in our case denotes the 
degree of last violation in the sequence. □ 

Lemma 3 Under the same assumptions, the weighted average Fu-.u+M is independent of 
the event that the points indexed u. . .u + M are part of a violating sequence starting at u. 

Proof By induction. Let the event in question be denoted violu-u+M- For M = 1, the 
identity follows directly from Lemma [21 

For M > 1, assume Fu-.u+M -i -L violu-.u+M-i, and prove for M. Due to the independence 
of the -F-s, trivially Fu-.u+M -L violu-.u+M~i- Now the event violu-.u+M = violu-.u+M-i H 
{Fu+M < Fu-.u+M~i}- But by Lemma [U Fu-.u+M -L {K+m < Fu-.u+M-i}- Therefore it is 
also independent of the intersection of these two events and the proof is complete. □ 
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These two lemmas show that weighted averages such as those produced by the PAV algo- 
rithm, are invariant to monotonicity violations (under the specified distributional assump- 
tions). This clears the way to a simple and useful fix to IR's flat-interval problem. 

Theorem 6 Under the same assumptions as above, for each violating interval from 1^ to 
lu+M, PAVA produces an estimate which is biased upward at lu and downward at lu+M from 
its unconditional expectations Fu,Fu+m, respectively. 



Proof From Lemma [3l 



E 



Fu:u+M I violu:u+M — E Fu-.u+M — Fu:u+M 



the weighted average of F over the violating design points. By the strict monotonicity of 

F, Fu < Fu:u+M < Fu+M- □ 

So, for these particular distribution assumptions on point estimates, we have "nailed" 
the IR estimate to be unbiased somewhere within the flat interval. It follows naturally, that 
percentile (inverse) estimates around the interval must be biased away from the interval to 
either direction. 

Of course, in a treatment-response case the binomial estimates {i^m} are not normal. 
However, the scenario covered by Theorem [6] approximates their asymptotic behavior as 
n — > oo, especially if the binomial variance factor F{1 — F) is changing slowly over the 
violating interval. 

Therefore, it may be useful to see where in the interval, according to that scenario, the 
IR estimator is unbiased. The answer is quite simple: 

Theorem 7 Under the same assumptions as above, and if F is twice continuously differ- 
entiable in x, then Fu-.u+M ~ ihe PAVA estimate for a violating interval from lu to lu+M ~ 
is unbiased to (Taylor) first order in segment length lu+M — lu, only at the weighted-average 
point 

u+M 

~lu:u+M = ^ i^vlv, where ujy = ^^^z — • {'i.ld) 

V = U 2—jj = U 

Proof First, from F's continuity, strict monotonicity and Theorem [6] we know that there is 
exactly one point at which Fu-.u+M is the true value of F, and that this point is in {Iu,Iu+m)- 
Now Fu-u+m can be written as a first-order Taylor expansion from a single reference 
point X* £ (IuJu+m)- 



Fu-.u+M = F{x*) + E:tu^vF'{x*){lu - X*) + 1 E:tu^vF"imv - X*)' 

F{X*) + F'{X*) {1u:u+M - X*) + ^Z:tu ^vF"{^u){lv - 
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where {^v} are points between x* and ltj,v = u . . . u + M. At x* = lu-.u+M, the first-order 
term vanishes, and therefore to first order, 

Clearly, this is also the only location in the interval that removes the first-order term. □ 
This result will be used in the next section to improve upon IR. 
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3.3 New Methodologies 

3.3.1 Averaging Estimators 
Overview 

Averaging estimators are not just commonly used in U&D; surprisingly or not, their perfor- 
mance is also pretty hard to beat under the popular SU&D and KR designs. The conver- 
gence of Markov chains implies x„ — > /Xtt, the stationary mean, with asymptotically normal 
behavior and without any further parametric assumptions on F. Averaging estimators 
typically have the smallest variance, and more often than not it is the variance term that 
dominates estimation performance. Averaging estimators' major drawback, as the previous 
section indicated, has to do with biases. Here is a summary of potential averaging estimators 
biases: 



1. Bias towards the starting point xi. 

2. Bias away from the closest boundary. 



3. Theoretical bias of /x^ itself, as exposed in Section 2.2. 

4. If a reversal-only average is used, an additional reversal-sampling bias. 

Since there are so many biases in potentially different directions, sometimes they neu- 
tralize each other to a certain extent. This is why the reversal-only average w occasionally 
outperforms the all-treatment average v in numerical simulations. This phenomenon will 
be observed numerically later on, in Section 13.41 

Items 2 and 3 on this list can be managed by choosing a sufficiently fine spacing (which 
unfortunately increases the magnitude of item 1) - and by not imposing unnecessary bound- 
aries. Additionally, for KR designs there is a potentially interesting fix to item 3 (stationary 
bias). Recall from Section [2.21 that KR's sub-chain of base states (i.e., picking only the first 
trial of each visit to a treatment level) has an identical stationary distribution to that of 
GU&:D(^, 0,1). This means that averaging only over the zero-state sub-chain may help us get 
rid of KR's first-order stationary bias. 

for general designs, the first item on the list - the starting-point effect - has received 
the most attention in devising new estimators. 

Let Wr b e the average of all reversal points from reversal r onward; according to 



Garcia-Perej (jl998l ). this is the most popular estimator in vision research for KR designs. 



Our analysis up to now indicates that it should be outperformed by Vr, the corresponding 
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all-treatment average starting from the same reversal point. How should r be chosen? This 
is a classic bias-variance dilemma: the farther out we push the cutoff point, the starting- 
point bias w.r.t /i^r decreases, but the variance increases. 

Or perhaps there is another way out? Are reversals a good cutoff choice at all? The 
previous section showed that under stationarity reversals have a larger variance than a 
randomly-chosen treatment. This means their expected distance from /i^r is larger. Since 
stationarity is in fact a best-case assumption, it seems that reversals are not a very good 
choice for cutoff. 

A New Adaptive Estimator 

Another approach seeks to adaptively identify the length of the initial transition phase. The 
underlying assumption is that the latter phase of the chain resembles a sample from vr. If 
so, the average of the chain's "tail end" is close to ^,r. Meanwhile, the "head" or starting 
point is arbitrarily fixed to xi0 The auto-detect estimator {jad chops off this biased 
"head", exactly at the first point in which the chain traverses to the other side of the "tail 
end" average, compared with the starting point - and then proceeds to average the rest of 
the chain. Symbolically this estimator can be described as 

VAD = Xc(n):n = ""'f"^} , , where c(n) = min {i : sgn (xj - ^ sgn (xi - X2;n)}-1- 

^ ' n — c[n) + 1 i<i<n 

(3.21) 

The location of the cutoff point is illustrated in Fig. 13.21 

Sometimes, the "tail end" of the chain is an "unlucky draw" containing an excursion 
far from target; sometimes the entire chain converges more slowly than expected by the 
experiment's designers. The auto-detect method would usually identify such cases by placing 
the transition point relatively late. From an overall performance perspective, using a late 
auto-detect point is not advisable; the probability that the auto-detect scheme has been 
fooled by an excursion - conditional upon the value of c(n) - increases sharply as c(n) 
approaches n. one should therefore set a maximum cutoff c{n)crit., so that if c(n) > c{n)crit. 
we use c{n)crit. as our cutoff. This critical value should definitely be less than n/2. Since it 
can be argued that the "tail end" needs to be sufficiently longer than the transition phase 
in order to serve as a reference, c{n)crit. values around n/4 or n/3 seem reasonable. Thus, 
the auto-detect estimator would be guaranteed to include at least the final two-thirds or 

similar situation is encountered in MCMC simulations, where a variety of diagnostics have been 
developed to determine stationarity. However, MCMC usually involves a very large sample and the ability 
to run several parallel instances of the same run - privileges not available in the small-sample percentile- 
finding application. 
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three-quarters of the treatment chain. 

If one wishes to be even more efficient, this critical value can be compared with the 
number of trials required to reach the boundaries from xi (if there are no physical bound- 
aries, visualize "virtual boundaries" around the range in which you realistically expect Qp 
to fall). Allow for one "switchback" of falling back one level towards xi before resuming 
progression. If such boundary-related predictions can be made, then resulting cutoff should 
be compared with n/3 or n/4, and the smaller of the two chosen for c{n)crit. For example, 
if the design is KR with k = 2 and the boundaries extend 5 levels in each direction, then 
one needs 5 -|- (A; + 1) = 8 trials to reach the bottom end including one "switchback", but 
5A; + (A; -|- 1) = 13 trials to reach the top end. Then, if n > 52, one cdbH use c(^Ti)crit. 

= 13 

instead of n/3 or n/4, and improve precision§| 

Conversely, if sample size is flexible, the auto-detect properties can be used to formulate 
a stopping rule. The experiment should continue until c(n) < n/3, or until n — c{n) passes a 
certain value - in both cases promising a sufficient sample size of approximately stationary 
sampling. 

For KR designs, we may also wish to look at vad,o ~ the auto-detect estimator taken only 
on the zero-state sub-chain. It promises to mitigate both the starting-point and stationary- 
mean biases simultaneously. This idea has not been explored in detail in this thesis, and 
may merit further examination. 

A Geometrically Weighted Estimator 

Instead of trying to locate a cutoff point, we can weight the chain according to convergence 
theory. In Section 2.2, we saw that 



where i^n+i = E and A# is the second-largest eigenvalue of P. This justifies the 

following weighted-average estimate: 



The weights should be a function that starts near zero, then increases until some transition 
point where we suspect the bias term becomes small compared to the variance term. The 
location of this point can be estimated by comparing the standard error of some raw form 

^Clearly, this indicates that for below-median targets xi should be positioned above the middle of the 
"effective range" ; see Section 13.61 



A' 



n 



(3.22) 




(3.23) 
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Auto-Detect Averaging 

o _ 



00 - 




Trial 

Figure 3.2: Illustration of the auto-detect estimator. The chain first crosses the "tail- 
average" line on trial 9, meaning that trial xg is the first one smaller than the mean of 
remaining trials (marked with a horizontal dashed line), while xi . . . were all larger than 
the analogous tail means. The auto-detect estimate vad is then the average of xg . . .X41. 
Trials 8 and 9 are circled. The chain used for illustration is the KR run from Fig. 12.11 but in 
principle this method could be applied to any sequential experiment where a starting-point 
effect is suspected, including non-U&D ones. 
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of V with some order-of-magnitude estimate for starting-point bias. 

Of course, we do not know P and its eigenvalues; however, P can be nonparametricahy 
estimated from F. Qgw provides a nice complement to vad'- the former uses responses to 
evaluate convergence, while the latter tracks the treatment chain. 

Two implementation notes: 

• The -F-derived P tends to produce optimistic rates, or even no rates (in case there is 
only one point estimate of F falling in (0, 1)). It is therefore recommended to weight 
F with a prior estimate, say uniform. A Bayesian saturated model, or any other 
parametric model, are also an option (see Section 4.1). However, there is no reason 
to 'overkill' this estimate of P with sophisticated procedures, as it is only used as an 
aid to determine weights. 

• Another implementation option is to multiply the decay factor by an 'acceleration' 
factor somewhat greater than 1. This ensures that the initial run's weight is indeed 
smaller than its impact on bias. 

At bottom line, Qgw has many more "moving parts" than other estimators. Since nu- 
merical trials did not indicate an overall performance improvement over the simpler options 
described above, it will not be discussed again below. 



Mitigating the Boundary Effect? 

If Qp is less than 2s away from a hard boundary, bias source 2 on the list at the beginning of 
the section can overwhelm all the rest. The best fix is prevention: not to place a boundary, 
unless it is physically impossible to continue moving up or down. If treatments are positive, 
consider placing {/m} on a log-transformed scale (as is the default in sensory studies). If 
they are proportions, consider a log-ratio transform. This is not as cumbersome as it sounds: 
phrase the treatment levels as u parts component B to one part component A, etc. Yet 
another alternative is to halve the spacing if a boundary seems to play a dominant role; 
a trigger for halving could be the 2nd visit to the same boundary, or the first transition 
decision violates the boundaries. Of course, this would have an impact only if the halving 
occurs relatively early in the experiment. 

Sometimes none of these fixes are available, and we are stuck with a real physical bound- 
ary. One existing fix was described in Section 13.11 (|Garcia-Perea . 1 19981 ) - continue adminis- 
tering treatments on the boundary, as a proxy for levels beyond it. This fix has the property 
of over-assigning treatments on the boundary. An alternative, milder fix would be to impute 
{xn} following the experiment by adding virtual treatments at Im+i wherever a move 'up' 
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was mandated from Im, and at /q wherever a move 'down' from h was mandated (these 
virtual 'treatments' would of course never be administered, only added to the average), 
the number of treatments added would follow the transition rule (e.g., for KR, k virtual 
treatments would be added each visit to Iq, but only 1 at Im)- This imputation is equivalent 
to assuming F = 0, 1 at the two boundaries. The imputation fix requires no change in 
experiment planning. 

It turns out, again, that neither of these fixes works as well as one might expect. This is 
because both increase the averaging-estimator variance - an increase which is not necessarily 
offset by a substantial bias reduction. It seems that, if the chain runs dangerously close to a 
boundary, one should turn to response-based estimation - which, coincidentally, is our next 
topic. 

3.3.2 Centered Isotonic Regression 

Due to the theoretical results on IR bias, culminating in Theorem [71 we may attempt 
to modify isotonic regression by replacing each constant interval with a single point at 
the location prescribed by (j3.20p . To be coherent, the new estimator would perform the 
same action at intervals that were originally constant as well. Algorithm 2 describes this 
estimator, dubbed "Centered Isotonic Regression" (CIR). 

Algorithm 2 Centered Isotonic Regression (CIR) for Treatment-Response Designs 
procedure ClR{{lm},{zm},{'^m}) 

while a = {u : 1 < u < m, Zu > Zu+i} ^ do 
V <— min(H) 
M ^ 1 

while > Zy+M do 

M ^ M + l 
end while 

7 7 _ T I \-^v+M-l 

Iv ^ I'V.v+M —1 — 2-^u=v '^u'-ul 2-^u=v 

Remove points {{ly+i, Zv+i,n^+i) . . . {I^+m-i, Zv+M-i,nv+M^i) 
end while 

Return {{Im, Zm,nm)}. 
end procedure 



CIR's point-estimate output is used to estimate F and its quantiles via linear interpo- 
lation, including at original design points which have been removed. The following obser- 
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vations may help clarify some more major similarities and differences between PAVA and 
CIR. 

1. CIR produces the same set of unique z values (point estimates) as PAVA, but under 
CIR each value appears only once. 

2. Unlike PAVA, CIR requires x values as inputs, and modifies them in the same manner 
as the z values. As a result, each unique weighted-average z value is assigned to the 
corresponding weighted-average x value §1 

3. CIR identifies ties, too, as violations since they violate strict monotonicity. 

4. In the absence of strict monotonicity violations, the two algorithms return the original 
data. 

5. In the presence of strict monotonicity violations, CIR returns a smaller number of 
points than the input. 

Due to the last property, sometimes a boundary point may be removed by CIR. Suppose 
w.l.o.g. that the original l\ has been removed, and now some > l\ is the smallest x value 
for which a CIR estimate exists. Then complementing the CIR output with a constant 
interval on is equivalent to the IR output on the same interval, and should be 

regarded as the default. However, in case prior knowledge exists regarding F on or outside 
the boundaries (e.g., F(0) = 0), then it should be incorporated either into CIR's input 
(when an appropriate weight can be assigned) or into its output. 

Conceptually, the CIR process is analogous to saying: "We have this sequence of point 
estimates, each perfectly valid on its own, but as a sequence they are suspect due to mono- 
tonicity violations. We can pool them together to get a single, more believable estimate; 
however, this is still a point estimate only, and deserves to be placed at a single point rather 
than occupy a complete interval." 

Under the normal-error scenario described in Section 13.21 and some additional assump- 
tions, it can be proven that CIR yields a smaller overall estimation MSE than IR, but the 
calculation is rather cumbersome and uninteresting. In nearly all numerical U&D simu- 
lations I have carried out, CIR clearly outperforms IR for quantile estimation (some data 
will be presented below in Section 13. 4p . The only case when CIR is not recommended, 

^The point placement prescribed by (I3.20p is optimal for the normal-error scenario and seems to work 
well for the treatment-response case as well. For other scenarios (e.g. a current-status dataset copmosed 
of I's and O's only), the formula for the optimal point may be different. 
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is when the researchers suspect F may resemble a staircase function. In that case, IR's 
piecewise-constant output is obviously more appropriate then CIR's linear smoothing. 

Table 3.2: Summary tables illustrating how CIR works in practice on treatment-response 
data. Data are identical to that shown on Table 13. 1[ 



Treatment Raw Input CIR Output Raw Input CIR Output 





Yes 


No 


F 


Yes 


No 


X 


F 


YesNo 


F 


Yes 


No 


X 


F 


0.17 





4 


0.00 
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0.17 


0.00 


1 7 


0.13 


1 


7 


0.17 


0.13 


0.33 


3 


9 


0.25 


3 


9 


0.33 


0.25 


4 8 


0.33 


6 


14 


0.40 


0.30 












0.50 


3 


7 


0.30 




10 


0.55 


0.28 


2 6 


0.25 






4 










0.67 


1 


3 


0.25 




4 


1.00 


4 





0.67 


1.00 


0.83 


1 


1 


0.50 


1 


1 


0.83 


0.50 















(Simulation Run 14) (Simulation Run 9) 



We now revisit the two data examples from Table 13.11 and Fig. 13.11 Table 13.21 and 
Fig. 13.31 show the same data, estimated via CIR. Since CIR's output is strictly monotone 
(at least between its output point estimates), the quantile estimate for Run 9 (RHS) is now 
unique. Note that CIR is not a risk-free fix. For example, in Run 9 the CIR estimate of 
F at /s is 0.56, while -F3 was only 0.25. This is because the new estimate is a weighted 
average of F2,F3 and F4 - the latter being equal to 1. From a least-squares perspective 
(i.e., least-square distance from the original point estimates), CIR is clearly not better than 
IR. However, with respect to the true F the pulling-in of neighboring F values to smooth 
out the estimate in violating intervals (where point estimates are more suspect) seems to 
pay off. 

A Combined Fstimator? 

Averaging estimators and CIR appear to complement each other: the former typically have 
smaller variance, and the latter smaller bias. This suggests the simple fix of using their 
average as a combined estimator. Unfortunately this fix is not successful, because the two 
estimator types are highly correlated (numerical ensemble correlations are typically around 
0.8 or more). This means that a simple average of CIR and an averaging estimator is 
likely to have higher variance than the averaging estimator alone. Since estimation error is 
dominated by the variance term (often over 90% of the MSE is due to variance, see below 
in Section [3.4|) . the improvement in bias may be insufficient to offset the variance hit. 

Why are averaging estimators and CIR so correlated? Suppose w.l.o.g. that we observe 
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Centered Isotonic Regression: Run 14 Centered Isotonic Regression: Run 9 




0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.2 0.3 0.4 0.5 0.6 0.7 0.8 

Treatment Treatment 



Figure 3.3: Graphs illustrating how CIR works in practice on treatment-response data. 
Data are the same shown on Table 13.21 Raw F values are in 'X' marks, and the CIR 
output values are in circles connected by solid lines. The dashed red lines indicate the 
inverse-interpolation process. Compare with Fig. 13.11 



a sequence of above-average thresholds. The response rate at most treatments will be below 
the population expected values, pushing F downward and CIR inverse estimates upward. 
At the same time, the negative responses will also push the treatment chain itself upward, 
and with it the averaging estimators. 



3.3.3 Interval Estimation 

Confidence Intervals for Averaging Estimators 

For averaging estimators, our rationale is that we sample from vr and try to estimate the 
standard error of the running sample mean. Trials before the cutoff point (be it a reversal 
or the auto-detect point) are discarded. A simple estimate for the standard error could be 
of the form 

S.E.i = (3.24) 

where S is the sample standard deviation of the sub-chain used for averaging, and ng// is 
an estimate of the effective sample size - which should be smaller than the sub-chain le ngth 
since the sampling is dependent. We can estimate n^ff using the approach offered by 



Choi 
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(|l99Cll ). again with some convenient simplifications. We assume an AR(1) structure. The 
variance of an averaging estimator v is then 



h'^Var{v) 



n 



n 



2p 



n — 1 

El 



4=1 



n 



2p 



[1 



hVar{v) 



1 + P 



(3.25) 



(3.26) 



ai 1- p 

where p is the first-order autocorrelation and n is the chain length used for averaging. The 
approximation in (|3.26p is conservative. So our autocorrelation-based sample size estimate 
is fieff = n(l — /5)/(l -|- p). Since the variance is estimated from the data, we should use a 
t-multiplier with ng// — 1 d.f. (or 1 d.f. if fie// < 2). 



A rather different approac h follows iTsutakawal (Il967bl )'s observation that intervals be- 



tween hitting times are i.i.d. ITsutakawal (|l967b[ i used the complicated formula (|3.5p to 



directly estimate the variance; however, as mentioned earlier this appears to be overly op- 
timistic. Instead, we estimate v^s variance from first principles. Let Nj be the length of 
hitting-time interval j, and let Aj be the treatment mean over the interval (Ideally, we use 
hitting times at the level closest to target, /„*)• Then each r.v. is i.i.d. across intervals!^ 
Now, neglecting possible "leftovers" before the first interval and after the last, averaging 
estimators can be expressed as 



n 



where n, as before, is the sample size used for averaging (i.e., from the cutoff point onward). 
Neglecting the randomness of n0 we focus on the numerator. For a single interval, 

Var{AjNj) w E [Var{AjNj\Nj)] + Var {E[AjNj\Nj]) (3.27) 

)] +Var{Nj)max{pT,,lu*)'^, 

where cj^ is the stationary variance and lu* is the level used for determining the intervals. 
Summing over all intervals, normalizing and replacing population parameters by estimates. 



We employ a convention that avoids gaps or overlaps, e.g. that each interval begins with the treatment 
following the hitting time. Additionally, for KR each sub-state should be viewed as a separate state. 
Hence, we look only at the first visit to each level (i.e., at visits to lu,o only), since those are the most 
numerous. 

^^For fixed-sample applications this is not a bad assumption. Of course, an approach using hitting-times 
or reversals (which are related to them) as stopping rules will have to account for h's variability. 
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we obtain the estimate 



S.E.i, 



h 



(3.28) 



where is the level closest to the point estimate, at which the hitting-times are counted; 
h is the number of hitting-time intervals, is the sample variance of the chain used for 
averaging, and n| and 5"^ are the sample mean-square and variance of interval lengths, 
respectively. As above, we use a t-multiplier with h — 1 d.f. (or 1 d.f. if /i < 2). 

Based on bot h theoretical appeal and numerical trials, I chose the second, hitting-time- 
based S.E. estimate (|3.28p . In general it is quite conservative. One reason for choosing 
the more conservative option, is that v estimates the stationary mean and not Qp, and 
therefore the interval will inevitably be somewhat shifted. Sometimes, when the shift is too 
large (i.e., one of the bias sources mentioned at the top of this section is too dominant), no 
interval would do the job. This again points to the importance of averaging-estimator bias 
reduction. 



Confidence Intervals for CIR 

CIR's intervals (and also IR's) can also be approximated using a simpler and more direct 
approach than the bootstrap. Consider the CIR interpolated quantile estimate Qp. Sup- 
pose w.l.o.g. that the it falls between the CIR-output treatment values lu and lu+i- The 
estimation variance can be approximated by 



Now, the forward estimate of F is simply a weighted average of binomial point estimates. 
Obviously 



Nearly always, the level with higher variance is the one with a smaller sample size, assume 
w.l.o.g. it is /„. Recalling that the {n^} are random. 




(3.29) 




Var{F^) 



E \^/ar{Fu\nu)\ + Var 
Fu{l-Fu)E — . 



) 



(3.30) 
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Note that the second term on the RHS is zero because -E[-Fm] remains (at least to first order) 
Fu regardless of sample size. Estimating the expectation of is generally infeasible 

without knowledge of F, but due to Jensen's inequality we can see that the variance is 
always larger than in the fixed-assignment case, as it should be. Note that this expression 
still does not account for the total variance in the CIR inverse estimate, since different runs 
may lead the estimate to fall in different intervals, i.e., different values of u. 

In any case, a simple way to continue from here is by using binomial quantiles for F^. 
We convert the standard piecewise-constant binomial quantile function, whose properties 
are strongly dependent upon the specific (and random) value of nn, into a strictly increasing 
q{nu,p), which is the linear interpolation between the exact-probability points {{pz, z)}, z = 
0, . . . + 1, where pz are defined via Pr{Z < z) = pz- This piecewise-linear quantile curve 
is then re-centered around the expectation np for symmetry. One way to then account for 
the inflated variance of (j3.30p . is to offset the curve by 1 to each direction away from UuP- 
Finally, divide by n„ to normalize. 

The resulting curve is always more conservative than standard binomial quantiles (see 
Fig. 13. 4p : however, at extreme percentiles it still suffers from the limitations of the binomial 
quantile function, and does not increase fast enough, considering that the experiment's 
sequential nature may generate heavy-tailed sampling distributions. Numerical trials show 
that the linearized function q does not provide enough differentiation between the 95th 
percentile and 97.5th percentiles, and therefore 90% CFs are overly conservative or 95% 
CPs lack in coverage (or both). 

Recall that rare-event binomial probabilities can be approximated by Poisson proba- 
bilities with rate riuP, and note that the Poisson variance is always greater than the ap- 
proximated binomial's variance. Therefore, using Poisson for the binomial tail is generally 
conservative for our purposes. We employ the same linearization method as above in order 
to avoid step- function artifacts, and additionally symmetrize the quantile curve by using a 
mirror image of the higher-probability tail in lieu of the less conservative lower-probability 
tail. 

The binomial-derived and Poisson-derived functions are compared on Fig. \'SA\ for two 
different small-sample scenarios. Even though the Poisson curve may be less conservative 
than the standard binomial for some middle percentiles, and is generally less conservative 
than the linearized binomial for all but the most extreme percentiles, it does provide a 
better (and more realistic) distinction between percentiles on the tail. If negative values for 
binomial or Poisson quantiles (as happens on the lower tail) seem counter-intuitive, recall 
that we are not really sure between which two treatment levels the true quantile falls. 
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n = 6, F = 0.5 n = 9, F = 0.2 




15 30 45 60 75 90 15 30 45 60 75 90 

Probability (Percent) Probability (Percent) 

Figure 3.4: Illustration of possible quantile functions used for CIR interval estimation. 
Shown are scenarios with riu = 6,p = 0.5 (left) and riu = 9,p = 0.2 (right). Depicted 
are standard binomial quantiles (black line), the linearized binomial (black 'x' marks) and 
linearized symmetrized Poisson (red line). The latter two are described in the text. 
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3.3-4 Down-Shift Schemes Revisited 

In the previous chapter I mentioned the popularity of designs for which transition rules 
change at some point, most commonly the first reversal. In particular, down-shift schemes 
cut the spacing by half at the first reversal. The analysis presented since then indicate that 
a reversal, and surely the first reversal, does not possess good properties for a transition 
point. They are quite likely to be far removed from target. Numerical experimentation 
confirms this: while under some conditions a reversal-based down-shift scheme provides 
for estimation performance improvement, overall it is less robust. Using the third reversal 
improves robustness, compared with the first reversal, though it does not solve the issue of 
reversal variance under stationarity. 

The hitting-time analysis performed earlier suggests another direction. Clearly, levels 
near target are hit more frequently (in probability). Therefore, one can use trial Xh^ as a 
transition point, where hj is the index where the most frequently-hit level is visited for the 
j-th time. The natural choice for j appears to be j = 3: a smaller value does not preclude a 
possible early excursion, and a larger value pushes the transition too far into the experiment. 
Numerical runs indicate that this scheme is roughly equivalent to using the third reversal in 
a reversal-based scheme; both provide overall performance (as measured via empirical MSE) 
somewhere between that of the original and the halved spacing. Due to space limitations, 
down-shift schemes are not part of the extensive numerical study presented in the next 
section. 
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3.4 Numerical Study 

3.4-1 Simple Comparisons 
IR and CIR 

The theoretical results about IR and CIR in Section 13.21 were for a related limit case of 
normal errors; however, the logic underlying them should hold for a wider array of scenarios 
and hence we may expect CIR to perform better. Even though CIR is a bias-reducing 
correction on an individual run basis, in ensemble-level simulation analysis the benefit should 
manifest itself in estimation variance reduction. For different runs, IR's flat intervals may 
occur to the right or left of target, throwing the inverse estimate off either upward or 
downward of target, respectively. Over a large ensemble, the individual-run offsets may 
balance each other out - but would inflate ensemble variance instead. 

This is indeed what Tabl e 13.31 and Fig. 13.51 show. The empirical efficiency gains of CIR 
over IR are quite substantial^ In general, they increase as s increases (data not shown). 
For exponential thresholds one can discern an average upward shift associated with CIR 
(right-most frame in Fig. 13. 5p , and for logistic thresholds a downward shift (middle frame) . 
This is because, F being concave around target in the exponential case and convex in the 
logistic case, the forward IR/CIR would under /over-estimate F, respectively - resulting in 
an inverse (quantile-estimation) bias in the opposite direction. IR's flat intervals mitigate 
this bias, while CIR's interpolation increases it. However, this bias increase is more than 
compensated by the variance reduction. 

Reversal- only and All-trial Averaging 

For averaging estimators, we expect reversal-averages w to be shifted upward and have 
larger variance than otherwise-identical all-trial averages v. This is indeed demonstrated 
in Table [37^ and Fig. 13. 6i In one scenario, w^s upward bias offsets some of the downward 
starting-point bias, yielding roughly equal performance for the two estimators. An impor- 
tant detail to note is that, in spite of averaging-estimators' plethora of bias sources, it is 
still the variance term that dominates performance. 

BCD , KR, CU&D^k,o,i) 

I end this simulation hors d'oeuvre with some cursory method comparisons. From my 
theoretical work, one would expect KR to show a clear advantage over BCD for averaging 



Keep in mind that the two estimate differ only over about haff the runs; if one compares MSE only over 
runs producing monotonicity-violations, the efficienty gain is about double that shown on the table. 
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Table 3.3: Comparison of isotonic regression (IR) and centered isotonic regression (CIR), 
from KR (fc = 2) simulations over 3 distributional conditions and 2 sample sizes. The 'Vio- 
lation Frequency' column indicates what proportion of runs (out of 2000) had monotonicity 
violations in a region that affects estimation - meaning that for the remaining runs the IR 
and CIR estimates were identical. E.E.R. is the 'Empirical Efficiency Ratio': the ensemble 
MSE of IR estimates w.r.t the true Qp, divided by the MSE of CIR estimates. All statistics 
were taken over the entire ensemble of 2000 runs. Bias and standard deviation are in the 
same arbitrary normalized units of treatments. Spacing was 0.1, meaning that m = 10. 



Distribution 


Sample 


Violation 


Bias 




Std. 


Dev. 


E.E.R. 


Size 


Frequency 


IR 


CIR 


IR 


CIR 


'Nice' 


20 


50.8 


% 


-0.003 


0.008 


0.091 


0.081 


1.24 


Logistic 


40 


50.6 


% 


-0.004 - 


-0.001 


0.071 


0.062 


1.30 


'Upward' 


20 


59.0 


% 


-0.015 - 


-0.035 


0.110 


0.097 


1.15 


Logistic 


40 


59.8 


% 


-0.005 - 


-0.011 


0.086 


0.076 


1.25 


Exponential 


20 
40 


42.5 
46.3 


% 
% 


0.039 
0.022 


0.045 
0.030 


0.152 
0.132 


0.132 
0.113 


1.28 
1.32 



Table 3.4: Comparison of reversal-averaging from reversal 1 (wi) and averages of all treat- 
ment from reversal 1 (vi), from KR (k = 2) simulations over 3 distributional conditions and 
2 sample sizes. Details are as in Table [331 



Distribution 


Sample 


Bias 




Std. 


Dev. 


E.E.R. 


Size 




Vl 




Vl 


'Nice' 


20 


0.036 


0.022 


0.068 


0.066 


1.24 


Logistic 


40 


0.019 


0.009 


0.052 


0.049 


1.20 


'Upward' 


20 


-0.069 - 


-0.071 


0.099 


0.093 


1.06 


Logistic 


40 


-0.037 - 


-0.044 


0.070 


0.068 


0.96 


Exponential 


20 

40 


0.057 
0.041 


0.047 
0.033 


0.112 
0.097 


0.111 
0.095 


1.09 
1.10 



estimators, but also for CIR due to its sharper stationary distribution. As Table [331 shows . 
the averaging-estimator gap is more pronounced (25% to 40% difference in MSE); the CIR 
performance gap between designs is much less dramatic, but is still observed across the 
board (including in many numerical runs not shown here). 
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'Nice' Logistic, n = 40 



'Upward' Logistic, n = 40 



Exponential, n = 40 




IR 



CIR 




Estimator 



Estimator 



Estimator 



Figure 3.5: Comparison of isotonic regression (IR) and centered isotonic regression (CIR) 
under the 3 scenarios summarized in Table 13.31 and a sample size of 40. Boxplots reflect 
only those runs in which the two methods yielded different estimates. In all plots, IR is on 
the left. Targets are marked by solid horizontal lines. 



The GU&D(,!j o,i) vs. KR comparison is less clear-cut (Table [3^ . KR does have an 
edge with averaging estimators: in spite of GU&D's smaller stationary bias, the GU&D 
experiment is composed of a small number of same-treatment cohorts - causing a "grainy" 
averaging estimator with increased variance. And again, it is variance that dominates 
performance. Using CIR, the two designs run neck-and-neck with GU&D having a slight 
edge in these particular scenarios. 

The botto m line from these comparisons (and others, not shown here but reported in 



earlier work: 



OronI (j2005l )). is that KR appears preferable over BCD - especially if we 



can make averaging estimators work. Between GU&D^;, q and KR, the decision would 
probably depend upon whether a cohort design is required, desired or feasible - since the 
performance differences are small. Also, clearly for BCD and GU&D, CIR appears to be 
the preferred estimator. 
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'Nice' Logistic, n = 40 'Upward' Logistic, n = 40 Exponential, n = 40 




1 1 1 1 1 1 

Reversals All Reversals All Reversals All 

Trials Used in Estimator Trials Used in Estimator Trials Used in Estimator 



Figure 3.6: Comparison of reversal-averaging from reversal 1 (wi) and averages of all treat- 
ment from reversal 1 (vi), from KR (k = 2) simulations over the 3 scenarios summarized in 
Table [3741 and a sample size of 40. In all plots, reversal-only averaging is on the left. 



3.4- S Detailed Estimator Study 

We can be reasonably confident that CIR performs better than IR, that the KR design 
has an edge over similarly-targeted BCD, and that reversal-averaging tends to shift the 
estimate upward and has a larger variance than all-trial-averaging. This, however, leaves 
many design decision questions unanswered. 

Many factors affect estimation performance - spacing, the starting-point offset xi — Qp, 
the distribution's shape and scale, the location of boundaries, the sample size and so forth. 
I have chosen to demonstrate the sensitivity of various estimators to the first two factors 
(spacing and the starting-point effect), by neutralizing all others or holding them constant. 

In order to make the results intuitively meaningful and more general, both s and xi — 
Qp, as well as estimation bias and standard deviation, were normalized by F's standard 
deviation. The following series of figures shows 2-D contour maps of estimator performance. 
The comparison is between vi (top left), wi (top right), vad (bottom left) and CIR (bottom 
right). 
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Table 3.5: Comparison of estimator performance from BCD (F = 0.293) and KR (k = 2) 
simulations. The two estimators compared are -Oi (top), and CIR (bottom). The E.E.R. 
column quantifies KR's performance edge over BCD. Details are as in Table | 



Estimator 


Distribution 


Sample 


Bias 




Std. 


Dev. 


E.E.R. 




BCD 


KR 


BCD 


KR 




'Nice' 


20 


0.031 


0.022 


0.074 


0.066 


1.34 




Logistic 


40 


0.011 


0.009 


0.056 


0.049 


1.30 




'Upward' 


20 


-0.084 - 


-0.071 


0.107 


0.093 


1.36 




Logistic 


40 


-0.055 - 


-0.044 


0.079 


0.068 


1.41 




Exponential 


20 
40 


0.058 
0.041 


0.047 
0.033 


0.123 
0.107 


0.111 
0.095 


1.27 
1.31 




'Nice' 


20 


0.009 


0.008 


0.081 


0.080 


1.01 




Logistic 


40 


-0.003 - 


-0.001 


0.063 


0.062 


1.03 


CIR 


'Upward' 


20 


-0.047 - 


-0.035 


0.098 


0.097 


1.11 




Logistic 


40 


-0.017 - 


-0.011 


0.077 


0.076 


1.05 




Exponential 


20 
40 


0.050 
0.032 


0.045 
0.030 


0.134 
0.115 


0.132 
0.113 


1.05 
1.04 



SU&D, Normal Thresholds 

We begin from the most favorable U&D scenario, for which the method was originally 
designed. The normal scenario is favorable for several reasons: 

• / is symmetric, and hence for SU&D the stationary bias of n-,^ is negligible; 

• The curvature of F around target is very mild, and therefore CIR is also expected to 
have very little bias; 

• / is thin-tailed, and therefore within-run excursions are less common. 

These advantages are borne out in Fig. 13.71 mapping estimation bias for n = 300 
Contours are color-coded to enhance interpretation: in general, "green is good" - i.e., a 
small bias magnitude. Blue indicates negative bias and yellow positive bias. 



^^AU scenarios illustrated in Figs. 13.71 through 13.151 were also mapped for half the sample size, but those 
maps were omitted here for brevity. 
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Table 3.6: Comparison of estimator performance from GU&D(2 0,1)) and KR {k = 2) sim- 
ulations. The two estimators compared are vad (top), and CIR (bottom). The E.E.R. 
column quantifies KR's performance edge over GU&D. Details are as in Table [331 



Estimator Distribution 


Sample 


Bias 




Std. 


Dev. 


E.E.R. 


oize 


GU&D 


KR 


GU&D 


KR 


Lognormal 


18 


-0.028 - 


-0.045 


0.132 


0.124 


1.05 


32 


0.007 - 


-0.007 


0.115 


0.106 


1.16 


VAD Normal 


18 
32 


-0.006 - 
0.008 - 


-0.027 
-0.013 


0.112 
0.094 


0.102 
0.085 


1.12 
1.20 


Gamma 


18 


0.001 - 


-0.023 


0.103 


0.099 


1.03 


32 


0.013 - 


-0.010 


0.082 


0.079 


1.08 


Lognormal 


18 


-0.011 - 


-0.018 


0.138 


0.140 


0.96 


32 


0.015 


0.012 


0.122 


0.121 


1.02 


CIR Normal 


18 
32 


-0.006 - 
-0.001 - 


-0.015 
-0.007 


0.111 
0.091 


0.111 
0.092 


0.99 
0.98 


Gamma 


18 


0.001 - 


-0.005 


0.104 


0.106 


0.96 


32 


0.007 


0.002 


0.081 


0.084 


0.93 



The dominant effect is starting-point bias, which gets worse with smaller s. The auto- 
detect estimator (bottom left) mitigates the effect rather well. Note that the bias contours 
of vi (top left) and wi (top right) are almost identical: as we found in Section 13. 2[ for 
SU&D the only shift between these two estimators would be due to skew, and here there is 
no skew. 

The starting-point effect even spills over to CIR (bottom right): with small s and n 
and large offset, most of the information about F is gathered somewhere between xi and 
Qp] since away from target F does have a substantial curvature, the resulting linear inverse 
interpolation is therefore systematically biased towards xi. 

If bias was the only "game in town" (and normal the only "distribution in town" ) , one 
would be led to believe that coarse spacing and CIR are the way to go. Fig. 13.81 maps the 
other component of estimation error - standard deviation - in the same units used for bias. 
Note that here the y-axes show absolute offset w.r.t. xi, because of the symmetry of both 
SU&D and F. Color-coding here still follows "green is good", with the SD becoming yellow, 
then brown as it increases. The same color-coding is used for MSE contours in subsequent 
figures. 
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The first observation is that for most design conditions, estimation SD dominates bias 
- for normal thresholds, by a factor of about 10 (compare SD magnitudes with Fig. I3.7p . 
Next, we see that as s increases, so does the SD. Finally, note that the estimators have a 
clear ranking w.r.t. SD - vi tightest, followed by vad, wi and CIR, whose SD is larger than 
vi's by about 10 - 20%. 

Next, we look at overall estimation MSE (Fig. 13. 9|) . In the figure, the MSE is normalized 
by the asymptotic, direct-sampling, threshold percentile estimation error (jl.ip . given in the 
introduction. Note that when all stars are aligned correctly - no starting-point offset, fine 
spacing, tight estimator - we can sometimes observe what seems like "superefficiency" : an 
error smaller than the asymptotic direct-sampling expression. This happens especially with 
averaging estimators, through which U&D converts percentile-finding into a potentially 
more efficient averaging process. 

Since the SD dominates estimation, the less variable vi or vad provide better overall 
performance. The former holds an edge for coarse spacing (s > a), where starting-point 
bias is small. On the other hand, the auto-detect estimator, since it mitigates starting-point 
bias while retaining a relatively small SD, offers the best overall combination of performance 
and robustness around s ~ cj/2 - a factor of 2 smaller than the optimal spacing usually 
quoted in literature. CIR's large SD bars it from being competitive under this favorable 
normal-thresholds scenario. 

SU&D, Exponential Thresholds 

What happens when the distribution is highly asymmetric? I chose the exponential with 
location parameter ^ and scale parameter a (equal to the threshold SD). The estimation 
SD plot has been omitted, since the patterns observed in Fig. 13.81 are largely distribution- 
independent. The bias magnitude is larger now (Fig. I3.10p . Averaging-estimator bias is 
positive almost across the board, since the second-order skew term is positive; it also in- 
creases with increasing s. Starting-point bias still takes its toll on the finest-spacing designs, 
but now also - albeit more mildly - on coarser designs0 The auto-detect scheme can do 
nothing about the stationary skew-related bias, and therefore vad^s bias map shows vertical 
iso-bias contours. 

CIR, too, has a positive spacing-dependent bias - because F is concave (when F's linear 
interpolation runs below the true curve, the bias is positive and vice versa). However, CIR's 
bias is certainly smaller in magnitude. 

In spite of larger biases, the magnitude of estimation SD is still typically about twice 



The latter effect is due to the interaction between starting-point and /'s skewness and restricted range. 
Unhke the simpler starting-point bias, this one can only be positive under exponential thresholds. 
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as large. The best combination of robustness and performance seems to be offered by vad 
(Fig. 13. m lower left), in the s a/4 to a/2 range - closely followed by CIR. Note that 
the "sweet spot" of quasi-superefficient estimation has moved to the lower left-hand corner, 
where the negative starting-point bias nearly cancels out the positive stationary biases. 

KR, Normal Thresholds 

We move on to non-median targets. The data shown here are for KR, k = 3. We look at 
bias and MSE w.r.t. to (5o.2 and not the exact formal target of (5o.20635 since there is little 
applied interest in the latter (the difference is quite small in any case) . CIR estimates were, 
indeed, performed for Q0.2 - giving that estimator a slight a priori edge. 

The starting-point biases make their appearances again, and are greater in magnitude 
than for SU&D - which is expected since KR converges more slowly (Fig. I3.12p . The other 
dominant feature for vi or vad are vertical iso-bias contours. This is the effect of /iTr's first- 
order stationary bias. As found in Section 13.21 this bias is negative and linear in s. CIR 
shows a negative spacing-dependent bias as well, because F is locally convex near (5o.2- The 
reversal-average wi shows very little of this effect; it appears that with normal thresholds, 
//tt's bias is almost perfectly neutralized by reversal-averaging's positive bias. 

Looking at estimation performance, the picture is now more complex (Fig. I3.13p . Be- 
cause of the stationary bias, vad is less attractive. Its performance deteriorates quickly as 
s increases; around s ~ f7/2 it is matched or beaten by CIR. Moreover, at somewhat larger 
spacing, as the starting-point bias diminishes and stationary biases kick in, reversal-average 
wi suddenly appears like a viable option. 

However, while vad^s and CIR's advantages are grounded in their theoretical robustness, 
the reversal-averaging surprise is more of a numerical coincidence related to a particular form 
of F, as we shall soon see. 

KR, Exponential Thresholds 

Using exponential thresholds, the bias picture changes yet again (Fig. 13.1^ . Now, the first- 
order spacing-dependent stationary bias has all but disappeared (it is still somewhat visible 
in the auto-detect map for n = 40). This bias seems to be neutralized by /'s upward skew, 
which would of course affect the telescopic sum (j3.7|) . This allows the starting-point bias to 
dominate averaging estimation again. 

As far as estimation goes, now wi is by far the worst among the four, vad and CIR 
are nearly equivalent with fine spacing (s « cr/4 to (t/2). As s increases, there is a narrow 
window in which vi appears to overtake them, before the first-order downward bias finally 
kicks in and hurts all three averaging estimators. 
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So for KR, normal and exponential thresholds produce almost the opposite outcomes, 
with different bias sources neutralizing each other for different estimators, depending upon 
the form of F. Note that overall, relative performance with KR is somewhat worse than 
with SU&D (when normalized by the asymptotic direct-sampling efficiency) . 

The same qualitative patterns were observed for /c = 2 simulation results as well, but 
have been omitted for brevity. 



3.4-3 Interval Coverage 

Table 13.71 compares the three interval-estimation options for CIR discussed earlier in the 
chapter, using data from SU&D runs. Even though the bootstrap was never used for 
inverse estimation in the U&D and IR context ( at least not in publish ed literature), it 
has been recommended for forward estimation by IStylianou et al.l (120031 ). They used the 
bias-corrected bootstrap; here, after some experimentation with various bootstrap variants, 
I opted for bootstrap-^, wh ich is considered relatively robust and with a rate-n accuracy 
(jPavison and Hinklevl . flQQsl . Ch. 5). 

The bootstrap results are quite disappointing (the same level of coverage was observed 
with other bootstrap variants, but not shown here). A possible explanation is that the 
bootstrap process fails to reproduce the hierarchical nature of variability (in sampling, and 
then in F values). On the other hand, the approach developed in Section [3^ using linearized 
quantile functions and allowing for the added variability in allocation, seems to be quite 
conservative with satisfactory overall coverage. As explained earlier, the binomial option 
fails to distinguish well between quantiles on the tail; the Poisson option does the job better 
while sacrificing little in coverage. Therefore, subsequently I have used the linearized- Poisson 
approach for CIR intervals. 

Table [3T8l survevs coverage performance across a wider variety of designs, for both CIR 
and the averaging estimator vad- For the latter, the hitting-time approach outlined in 
Section [331 was used (see equation (|3.28p ). Overall, coverage is good, tending to be on 
the conservative side. Even in cases when the average coverage appears to dip below the 
nominal level, it is usually due to a single difficult scenario out of 6 — 8 used for calculations; 
for all other scenarios coverage was close to, or above, the nominal level. 

The interval estimators have their vulnerabilities: CIR coverage suffers as F becomes 
shallower, since then the F-s would show many monotonicity violations, resulting in fewer 
points in CIR's output, points whose location would vary from run to run0 If F around 
target is very shallow indeed, the coverage method would break down; this seems to occur 



Using IR would hardly alleviate this: with IR, the output would have wide flat intervals, with inverse 
estimation highly variable as a result. 
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somewhere below AF ~ 0.05 between levels. 

Averaging-estimator CI's break down most conspicuously when the estimator's bias 
(whether due to starting-point or to boundary effects) is too large. Cases when target lies 
on the boundary are not reflected in the averaging-estimator coverage statistics reported 
here. On the other hand, CIR coverage is quite resistant to boundaries. 
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Table 3.7: SU&D interval coverage for CIR estimates under four distribution scenarios, and 
n = 15,30. In all scenarios m = 10, with relatively fine spacing (s/a < 0.5). Shown are 
ensemble coverage percentages for 90% (top) and 95% (bottom) CPs, based on ensembles of 
2000 runs per scenario. The bootstrap simulations used 999 bootstrap repetitions per run. 



Coverage of 90% Confidence Intervals: CIR for SUfcD 











Distribution Scenario Code 




Interval Method 






"gam5090" 


"gaml090" 


"log0280" "norm3330" 


Bootstrap-t 


n 


= 15 


70.6% 


70.4% 


71.2% 


73.6% 


n 


= 30 


79.3% 


78.7% 


80.1% 


78.9% 


Linecirized Binomial 


n 


= 15 


96.7% 


92.1% 


95.2% 


94.8% 


Quantiles 


n 


= 30 


97.2% 


94.4% 


95.5% 


95.3% 


Linecirized Poisson 


n 


= 15 


92.1% 


90.1% 


92.1% 


91.6% 


Quantiles 


n 


= 30 


96.8% 


93.9% 


95.3% 


95.1% 


Coverage of 9 


•5% Confidence Intervals: 


CIR for SU&D 










Distribution Scenario Code 




Interval Method 






"gam5090" 


"gamlOQO" 


"log0280" ' 


'norm3330" 


Bootstrap-t 


n 


= 15 


78.9% 


77.7% 


77.9% 


79.4% 


n 


= 30 


85.5% 


84.7% 


85.8% 


86.1% 


Linecirized Binomial 


n 


= 15 


96.9% 


93.1% 


96.2% 


95.9% 


Quantiles 


n 


= 30 


97.9% 


95.4% 


97.1% 


96.6% 


Linearized Poisson 


n 


= 15 


95.8% 


93.4% 


95.8% 


95.3% 


Quantiles 


n 


= 30 


98.4% 


96.4% 


97.5% 


97.2% 
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Table 3.8: Interval Coverage Summary for various designs and estimators. Shown are em- 
pirical coverage statistics of 90% (top) and 95% (bottom) CI's. For non-median designs, the 
targets of Q0.2 and Qo.s were used, rather than each design's actual target. All summaries 
are of 6 distribution scenarios each, with an ensemble size = 2000 and a relatively fine 
spacing (me// ~ 10), except for the GU&D summary (second row in each table), which is 
from 8 scenarios, N = 1000 and a fixed-boundary m = 6. 



90% Confidence Intervals 



Design and Estimator 






Mean 


S.D. 


Min. 


Max. 


KR {k = 3) and CIR 


n = 


20 


89.2% 


2.6% 


86.3 


% 


93.1% 


n = 


40 


92.1% 


3.2% 


86.5% 


95.5% 


GU&;D(2 0,1) and CIR 


n = 


18 


94.7% 


3.3% 


89.6 


% 


99.5% 


n = 


32 


95.7% 


1.6% 


93.8 


% 


98.5% 


KR {k = 3) and vad 


n = 


20 


94.8% 


3.0% 


89.2 


% 


97.1% 


n = 


40 


95.8% 


2.7% 


92.2 


% 


99.6% 


SU&D and vad 


n = 


15 


91.8% 


4.1% 


84.9 


% 


97.1% 


n = 


30 


96.6% 


3.7% 


90.0 


% 


99.7% 



95% Confidence Intervals 



Design 


and Estimator 






Mean 


S.D. 


Min. 


Max. 


KR {k 


= 3) and CIR 


n 


= 20 


93.3% 


2.2% 


90.4% 


95.8% 


n 


= 40 


95.4% 


2.4% 


91.4% 


97.8% 


GU&D 


(2,0,1) and CIR 


n 


= 18 


97.1% 


1.8% 


94.8% 


100.0% 


n 


= 32 


97.5% 


1.3% 


95.5% 


99.4% 


KR {k 


= 3) and vad 


n 


= 20 


95.5% 


2.6% 


90.7% 


97.9% 


n 


= 40 


97.9% 


1.4% 


96.4% 


99.9% 



SU&D and vad 



n = 15 94.7% 2.3% 90.6% 97.5% 
n = 30 98.2% 2.0% 94.7% 99.9% 
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SU&lD, Normal Thresholds, Empirical Bias, n = 30 
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Figure 3.7: Sensitivity of selected estimators to normalized spacing s/a (horizontal axis) 
and normalized starting-point offset (xi — Qp)/(T (vertical axis). Contours are of SU&D 
ensemble bias normalized by cr, for n = 30. Contours are color-coded: blue indicates 
negative bias, yellow positive bias, and green indicates near-zero bias. In each page, four 
estimators are shown: vi (top left), wi (top right), vad (bottom left) and CIR (bottom 
right). The threshold distribution is normal. Each contour plot uses 5 by 5 empirical bias 
statistics, each of which was calculated from 2000 runs. 



82 



SU^D, Normal Thresholds, Empirical SD, n = 30 



All Trials from Reversal 1 Reversals from Reversal 1 
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Figure 3.8: Sensitivity of selected estimators to normalized spacing s/a (horizontal axis) 
and to absolute offset \Qp — xi\/a (vertical axis). Contours are of SU&D ensemble standard 
deviation normalized by a, for n = 30. Contours are color-coded: as the SD increases, 
color changes from green through yellow to brown. All other details are as in Fig. 13. 7i 



SU<S^D, Normal Thresholds, Empirical MSE, n = 30 
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Figure 3.9: Sensitivity of selected estimators to normalized spacing s/a (horizontal axis) 
and to absolute offset \Qp — xi\/a (vertical axis). Contours are of SU&D ensemble MSE, 
normalized by the asymptotic percentile-estimation error (11.11) . for n = 30. Color-coding is 
as in Fig. 13. 8i All other details are as in Fig. 13. 7[ 



84 



SU^D, Exponential Thresholds, Empirical Bias, n = 30 
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Figure 3.10: Sensitivity of selected estimators to normalized spacing s/a (horizontal axis) 
and to offset {xi — Qp)/cr (vertical axis). Contours are of SU&D ensemble bias, normalized 
by the threshold SD, for n = 30. The threshold distribution is exponential. All other details 
are as in Fig. 13.7^ except that here 4 by 5 points under ly the contours. 
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SU^D, Exponential Thresholds, Empirical MSE, n = 30 
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Figure 3.11: Sensitivity of selected estimators to normalized spacing s/a (horizontal axis) 
and to offset {xi — Qp)/a (vertical axis). Contours are of SU&D ensemble MSE, normalized 
by the asymptotic percentile-estimation error (jl.ip . for n = 30. The threshold distribution 
is exponential. Color-coding is as in Fig. 13. 8i All other details are as in Fig. 13.101 
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KR {k = 3), Normal Thresholds, Empirical Bias, n = 40 
All Trials from Reversal 1 Reversals from Reversal 1 
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Figure 3.12: Sensitivity of selected estimators to normalized spacing s/a (horizontal axis) 
and to offset (xi — Qp)/cr (vertical axis). Contours are of KR (A; = 3) ensemble bias w.r.t. 
Qo.2, normalized by the threshold SD, for n = 40. The threshold distribution is normal. All 
other details are as in Fig. 13.71 
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KR [k = 3), Normal Thresholds, Empirical MSE, n = 40 
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Figure 3.13: Sensitivity of selected estimators to normalized spacing s/a (horizontal axis) 
and to offset (xi — Qp)/cr (vertical axis). Contours are of KR {k = 3) ensemble MSE 
w.r.t. Qo.2t, normalized by the asymptotic percentile-estimation error (11. ip . for n = 40. 
The threshold distribution is normal. Color-coding is as in Fig. 13.81 All other details are as 
in Fig. [321 



KR {k = 3), Exponential Thresholds, Empirical Bias, n = 40 
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Figure 3.14: Sensitivity of selected estimators to normalized spacing s/a (horizontal axis) 
and to offset [xi — Qp)/u (vertical axis). Contours are of KR (A; = 3) ensemble bias w.r.t. 
Qo.2) normalized by the threshold SD, for n = 40. The threshold distribution is exponential. 
All other details are as in Fig. 13. 7i 
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KR {k = 3), Exponential Thresholds, Empirical MSE, n = 40 
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Figure 3.15: Sensitivity of selected estimators to normalized spacing s/a (horizontal axis) 
and to offset (xi — Qp)/(T (vertical axis). Contours are of KR (k = 3) ensemble MSE 
w.r.t. Qo.2, normalized by the asymptotic percentile-estimation error (jl.ip . for n = 40. 
The threshold distribution is exponential. Color-coding is as in Fig. 13.81 The green bands 
showing on the top-left corners of the vi and wi maps are artifacts, as the MSE there is so 
high that the contours need to cycle through the entire spectrum again (w)i's normalized 
MSE reaches as high as 13). All other details are as in Fig. 13.71 
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3.5 The Anesthesiology Experiment 



As explained in the introductio n, this study has been motivated by a consulting collabora- 
tion on a real experiment plan (|Qronl . |200J) . The experiment has event ually been carried 



out f rom late 2004 to early 2007, and initial results are being presented ([Bhananker et al 



20071 ). Dr. M.J. Souter has kindly allowed me to present the data here as well. This will 
help illustrate the theoretical features described so far, and to demonstrate the inevitable 
gap between theory and practice. 



3. 5. 1 Background 



Propofol is the current anesthetic agent of choice for routine surgical procedures in the 
industrialized world. It has replaced previous agents such as thiopental, because it has no 
adverse effects on blood pressure like other agents, it enables faster patient recovery from 
drowsiness after the procedure, and for other reasons. However, propofol has a very unpleas- 
ant - and paradoxical - side effect: in the minutes before passing out, many patients feel 
intense pain. Several studi es and meta-analyses estim ate the overall population frequency 
of pain responses at 70% (jPicard and Trameii . l2000l ). Devising methods to alleviate this 
side effect is an ongoing practical and academic endeavor. 

One interesting approach has been to mix propofol with thiopental - an older agent 
with less favorable overall outcomes, but without the pain side-effect. Several studies have 
shown that the primary therapeutic (anesthe sizing) effect for s uch mixtures is additive - 
meaning that efficacy is not adversely affected (IVinik et al.l . Il999l ) - and that the prevalence 
of pain responses is reduced (jPollard et al.l . 120021 '). However, to date no study had yielded a 
clear, definitive recommendation for mixture proportions to the general patient population. 
Being familiar with UfcD, which is used ext e nsively in ane sthesiology for median-finding 



(jCapogna et al 



2001 



Camorcia et al 



2004 : 



Fisher 



20071 ). Dr. Souter approached our 



Consulting Service to inquire whether U&D methodology can be used to find a propo- 
fol/thiopental mix that reduces pain response frequency to under 20%. 



3.5.2 Design Constraints and Initial Recommendations 

Since propofol is used in routine surgery, and since the proposed treatment mixes it with 
another FDA-approved agent, this experiment was relatively free of the typical ethical and 
sample-size constraints encountered in medical research§ In practice, any adult (within 
certain health and age boundaries) assigned to surgery, who agreed to participate in the 



Of course, an appropriate IRB was submitted and approved. 
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experiment could be recruited. On the other hand, this walk-in recruiting suggested that 
indeed, a sequential design such as U&D would work better than a non-sequential one. 

Our initial recommendations to Dr. Souter were based upon a literature search and 
numerical simulations. The search has turned up both KR and BCD. KR exhibited superior 
performance across the board in simulated runs, in line with the few numerical comparisons 
appearing in literature. The runs also indicated that for an estimation with CI's of about 
±10%, sample size shou l d be quite larger than those discussed in the papers presenting BCD 



(iDurham and Flournoyl . Il995l : iDurham et al.l . 1 19951 ) - closer to 100 than to the samples of 
15 — 35 simulated there. Additionally, the convergence-stationarity tradeoff with respect to 
spacing was identified. I recommended k = A (targeting Qo.isg); narrowly over A; = 3, both 
because of somewhat better simulation results and to make sure the target is below Qq,2- I 
also recommended a spacing of 10% and a starting point somewhere near the middle of the 
range between and 100% propofol. 



3.5.3 Raw Results 

The experiment went underway in fall 2004 at Harborview Hospital in Seattle, with a KR 
(k = 4) design, starting point of 50 : 50, and 10% spacing. Treatment assignments were 
double-blind, and there was a single designated observer to evaluate whether pain response 
had occurred. The first stage was characterized by very low incidence of pain responses. 
After 58 patients, and after several rounds of consultation, the pain evaluator has changed 
and the design was switched to /c = 3, continuing from the point at which the experiment 
was at that time (80% propofol), until trial 90 when the experiment ended. Trials 59 — 90 
will be referred to as Stage 2. Some sample demographics are summarized in Table [3M The 
two stages' chains are illustrated in Fig. 13. 16^ and the treatment-response summary tables 
are shown in Table 13.101 . 



3.5.4 Conceptual Interlude 

The experiment's rationale can be summarized as follows: we assume there is some gen- 
eral population of potential patients, whose sensitivity to pain given a certain propo- 
fol/ thiopental mix can be modeled via an overall threshold subdistribution J^, monotone in- 
creasing with the proportion of propofol, and reaching ~ 0.7 at a propofol-only treatment. 
We are looking for the 20th percentile of or more precisely, for an easy-to-administer 
mix just below that percentile. Since the experiment was performed at a single convenience- 
chosen hospital, the sampling population is not the general population. However, 70% pain 
at propofol-only treatment is a rather universally well-established figure, and Dr. Souter's 
prior experience at that hospital had been compatible with this frequency as well. Hence, 
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Table 3.9: Age and gender summary table for the propofol/thiopental experiment. 



Stage 1 (trials 1-58) Stage 2 (trials 59-90) 

Female 13 10 





Minimum 


20 


19 




First Quartile 


28 


25 


Age 


Median 


43 


36 


Mean 


41.8 


37.3 




Third Quartile 


51 


47 




Maximum 


81 


64 



Table 3.10: Treatment-Response summary table for the propofol/thiopental experiment. 



Propofol 


Stage 1 (trials 1-58) 

Pain No Pain F 


Stage 2 (trials 59-90) 
Pain No Pain F 


Pain 


Overall 

No Pain 


F 


50% 





4 














4 





60% 





8 








12 








4 





70% 


1 


16 


0.06 


4 


11 


0.27 


5 


27 


0.16 


80% 


3 


13 


0.19 


2 


3 


0.40 


5 


16 


0.24 


90% 


2 


6 


0.25 








2 


6 


0.25 


100% 


1 


4 


0.20 








1 


4 


0.20 


Total 


7 


51 


0.12 


6 


26 


0.19 


13 


77 


0.14 
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Propofol/Thiopental Experiment, Stage 1 




10 20 30 40 50 60 

Trial 



Propofol/Thiopental Experiment, Stage 2 



o 




60 65 70 75 80 85 90 

Trial 



Figur e 3.16: Chains of the propofol/thiopental experiment reported by (iBhananker et alJ . 
20071 ^. Stage 1 (trials 1 - 58) is on top and Stage 2 (trials 58 — 90) on bottom. Responses 
are marked as in Fig. 12.11 
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we had no reason to expect sharp deviations from the general-population baseline. 

Pain evaluation presents another difficulty. It is not measured via a standard instrument, 
but evaluated by a subjective observer; thus T actually describes re-ported pain outcomes 
from the population of potential patient- evaluator pairs. Therefore, from a statistical point 
of view we should ideally sample (or randomize) several evaluators simultaneously with 
the sampling of patients. However, the experiment design was constrained to a single, 
specifically-trained evaluator, with a scheduled change of evaluators at a certain time point. 
The two stages are in effect samples from two different threshold-evaluator sub-populations, 
with subdistribution functions F^^^ and F^"^^ . 

To sum it up, the experiment does not sample the general population of pain thresholds 
and pain evaluators. Fortunately, the goal is not a purely academic exercise of studying J^, 
but the practical determination of a mix that might reasonably work for general samples out 
of J-. Our hope is that the samples from F^^^ and F^"^^ would show a behavior sufficiently 
compatible with what we know about to make our percentile estimate practically useful 
for the general population. 

3.5.5 Analysis and Estimation 

The results from Stage 1 (the first 58 trials) raise concerns regarding this compatibility. 
In particular, beginning with subject 11 there were 20 consecutive trials with no pain re- 
sponse, culminating in 4 no-pain trials at propofol-only treatment before finally observing 
pain on the fifth. Given our knowledge of J- we'd expect a pain probability of 0.7 for that 
treatment. For a geometric r.v. with success probability 0.7, there is less than 0.01 chance 
of observing 4 or more failures before the first success. A more comprehensive sensitiv- 
ity analysis incorporating all 58 Stage 1 trials within a Bayesian framework was carried 
out, using various functional forms for F^^\ and a prior distribution calibrated to pre- 
dict F(^)(100% propofol) 0.7. The results strongly suggest that Stage 1 data are not 
compatible with the population baseline (Fig. 13.171 top). In fact, posterior medians for 
F(^)(100% propofol) are around 0.3. 

Later upon closer examination, it was found that Stage I's patients included subjects 
aged 76 and 81, who should definitely have been excluded from the experiment due to old 
age, and 3 more patients 65 years or older whose inclusion was marginal. None of them 
exhibited pain0 Since the design was sequential, this affected allocation to all subsequent 
patients, hence distorting all averaging estimators. Since Stage 2 had no patients older than 



The common assumption is that the nervous system loses its sensitivity with age, and hence pain 
incidence would decrease. The oldest patient out of all 90 trials to exhibit a pain response was 57 years 
old. 
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Figure 3.17: Histograms of posterior pain frequency at a propofol-only treatment, based 
on Stage 1 (top) and Stage 2 (bottom) data, and using location-scale logistic (left), shape- 
scale Weibull (center) and 3-parameter logistic (right) Bayesian models. The thick vertical 
lines mark pain probabilities of 0.6 and 0.7. The posterior probabilities for pain frequency 
exceeding 60% based on Stage 1 data were 0.008,0.002 and 0.002 for the three models, 
respectively. The analogous probabilities based on Stage 2 data were 0.36,0.17 and 0.24. 
Each posterior distribution was approximated via MCMC, using 400,000 cycles thinned by 
a factor of 40. 
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64, this presents yet another difference between F^^^ and F^'^\ 

Stage 2 exhibited a rather different response pattern from Stage 1, a pattern reasonably 
compatible with the population baseline - though still a bit on the low side (Fig. 13.171 
bottom) @ Therefore, we decided to estimate Q0.2 for the general population using only 
Stage 2 data. 

This estimation is quite straightforward: since F^'^^ is monotone, the IR and CIR es- 
timates are identical - yielding a point estimate of 67.5% propofol for 20% pain response, 
with a 95% CI of (55.9%, 79.1%). The auto-detect averaging estimator vad identifies the 
stage's third trial as the averaging starting point, while the first and third reversals occur 
on the second and fourth trial, respectively, vad and the reversal-cutoff estimates vi,V3 
are 67.33%, 67.42% and 67.24% propofol, respectively - very close to the IR/CIR estimate. 
The 95% CI for vad is (51.8%, 82.8%) Considering convenience of application, my clinical 
collaborators therefore decided to recommend a 2 parts (66.7%) propofol, 1 part thiopental 
mix for reducing pain response frequency to around 20%. 



^ A label-permutation test showed some evidence for different evaluation sensitivities between the two 
evaluators, even after removing the two oldest patients (p — 0.04). Ifowever, the evidence for this is 
confounded by the differences in age and gender composition between the two stages (cf. Table |3]9}. 

^^It seems that the IR/CIR interval estimate is a bit too optimistic, because of the steep slope of F between 
60% and 70% propofol. 
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3.6 Optimizing Up-and-Down: A Practical Summary 

Chapters 2-3 have traversed a wide range of U&D topics. It is time to tie back the loose ends. 
The optimization recommendations below are split into "Clearly Indicated" - meaning 
that solid theory and/or across-the-board numerical results support the conclusion, and 
"Recommended" ~ meaning that my work suggests this direction, but it is not as clear- 
cut. Within each category, conclusions are arranged by subtopics. 



3.6.1 Clearly Indicated 

Method and Sample Size 

1. Designs with super-small n < 10 should be avoided. Even though SU&D is fast- 
converging, that convergence is probabilistic and does not apply deterministically to 
each individual run. Under the recommended design decisions (see below), SU&D 
convergence would typically take up (on the average) at least half a dozen trials, and 
at least twice that for other U&D variants. 

From the theoretical and numerical results presented here, a reasonably minimalist 
sample-size range would be 16 — 32 for SU&D, and around 8 — 16 times {k + 1) for 
KR designsH 



2. For non-median targets, KR is clearly recommended over BCD, unless the target 
percentile cannot be approximated by a KR design. 

Design: Spacing, Boundaries, Stages 

1. Boundaries should be avoided. In case there is a real physical boundary (e.g., at zero), 
a logarithmic (or log-ratio) treatment scale should be used if feasible. If boundaries 
are unavoidable, then either use a scheme that halves the spacing given some trigger, 
and/or estimate with the more robust CIR. 



2. The recommendation, dominant since 



Dixon and MoodI (jl948l ). to set s « cr, is wrong. 



The indicated spacing magnitude is about half that, or even a bit smaller. Conceptu- 
ally, this is because U&D's stationary advantages do not kick in unless vr has a clearly 
defined, sharp peak. For SU&D with s ~ cr, rn^ff. - the number of levels where F is 
substantially different from or 1 - is about 4 — 6, too little to guarantee such a peak. 
Technically, the spacing effect manifests itself in estimation variance and bias which 



The reason for the {k + 1) factor is that to move 'up' and back 'down' requires at least A: + 1 trials. 
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both increase with s. The main adverse effect of smaller s is larger starting-point 
biases - but there are solutions that mitigate much of this effect. 

Overall, with knowledge of a one should set a spacing somewhat smaller than cr/2, 
and yet a bit smaller for non-median targets. Without such knowledge, aim for JTig//. 
to be somewhere between 8 and 12 (for non-median target, one should estimate me//, 
as the number of levels where we expect F to be solidly within the interval (0, 2p)). 
If one prefers the AF approach, that is, set s w.r.t to the expected increment of F 
between levels around target, then the recommendations are equivalent to the range 
AF ~ 0.1 to 0.15 (the currently recommended spacing translates to AF ~ 0.3). 

3. The optimal starting point minimizes the expected number of trials to target. Whether 
design boundaries exist or not, visualize a "realistic range": the range in which you 
realistically expect the target might be. The starting point should then be the treat- 
ment from which the chain would take the same number of steps to either end. For 
example, with k = 2, xi should be 2/3 of the way up that range (see Section [3.51 for 
my mistaken placement of xi in the anesthesiology experiment, which cost about a 
dozen trials). 

Estimation 

1. Centered Isotonic Regression (CIR) is clearly preferred over IR, unless one knows 
that F resembles a stair-step function. Even though the bias reduction was proven 
only for an approximate limit case, the logic underlying the correction is generic, and 
numerical results point to a 20 — 30% or even greater improvement in efficiency under 
most scenarios. 

2. BCD and GU&D should be used with the CIR estimator and not with averaging 
estimators - the former because of its slow convergence, and the latter because of the 
graininess of cohort-based chains. 

3. For SU&D and KR, the reversal-average w should be abandoned in favor of v-type 
estimators averaging all treatments from a cutoff point. They have smaller variance, 
and one bias source less to worry about. 

4. The first reversal is too early for such a cutoff point, as it is far from guarantee- 
ing removal of the starting-point effect. Alternatives will be discussed below under 
"Recommended" . 
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3.6.2 Recommended 

Method and Sample Size 

• Regarding cohort-based group U&D (GU&D) designs: designs with = (r^ being 
the "remain at the same level" probability) at all levels should inherently converge 
faster. Conversely, the stationary profile becomes steeper (more "peaked") as r„ in- 
creases. The GU&;D(fc 0,1) design, examined here numerically, seems to be a reasonable 
candidate for Phase I trials (with A; = 2 or 3). Note the comment above regarding 
GU&D estimation. 



The question of whether better U&D met hods can be found or whethe r an optimal 
one can be formulated - briefly treated by iBortot and Giovagnolil (j2005l ) for second- 
order methods using a BCD-type randomization scheme - remains open. However, 
it appears that the combination of simplicity, performance and usage track record 
offered by SU&D and KR, would be hard to beat. 



Design: Spacing, Boundaries, Stages 



The "layover" boundary conditions suggested bv iGarcia-Perea (| 19981 ) to mitigate the 
boundary effect, do not appear to be advantageous on the whole. Whatever bias 
reduction they offer in some scenarios, is offset by opposite bias in scenarios where 
the underlying assumption (that F remains constant beyond the boundary) does not 
hold - and by increased variance. The milder "imputation" fix suggested here suffers 
from the same problems. Again, the best way to deal with a boundary is to avoid 
having it, to half the spacing upon brushing up against it, and to estimate using a 
boundary-resistant estimator such as CIR. 



Estimation, Stopping Rules, Multistage Schemes 

• Instead of the first reversal, one should set the cutoff point for averaging estimators 
either at the auto-detect point developed here, or possibly at the third reversal. The 
AD point has the advantage of being adaptive: in case the starting-point effect is small, 
the AD point will happen early - while the other alternatives would lose efficiency. 
However, AD should be used with care: the AD point is not to be allowed to be 
greater than n/3, or than the number of trials required to reach each end of the 
"realistic range" (described above) ~ leaving room for one "switchback" along the 
way. For example, for SU&D if that range is 5 levels from xi, then the AD point 
should not be later than 5-1-2 steps later, i.e, no later than xg. Clearly, if n is too 



small to allow for meeting the latter criterion, then it should probably increased (or 
the spacing made coarser). 

For KR designs, one may wish to explore incorporating the AD estimator over the zero- 
state sub-chain only - an estimator which gets rid of the first-order stationary bias, 
and introduces a second-order bias in the opposite direction (perhaps some average of 
VAD and VAD,o can be used). 

The AD point could also be used for a stopping rule, by mandating that the experiment 
continue until AD analysis yields h trials used for averaging. 

For interval estimation, the bootstrap based on point estimates of F does not appear 
to capture the magnitude of U&D's inverse-estimation variability. The more direct 
methods developed here are suggested instead. 

Regarding multistage schemes: here, too, using the first reversal as a transition point 
seems too risky. Either the third reversal, or the third or fourth hitting-time at the 
most frequently visited level are more robust alternatives. Since these events typically 
occur only a dozen or more trials into the experiment, one should probably refrain 
from multistage schemes unless n is large enough to a allow for a substantial part of 
the experiment to take place after the transition point. 
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Additional Glossary for Chapter 3 

Note: some terms appearing only in the sequence of proofs leading up to Theorem [6] are 
omitted here. They are of no interest beyond that particular result, where they are ex- 
plained. 

CIR: Centered Isotonic Regression. 
IR: Isotonic Regression. 
NPMLE: The nonparametric MLE. 

PAVA: Pooled Adjacent Violators Algorithm. The algorithm used to produce IR esti- 
mates. 



Bc-d- An average of "i?" values, from Be to B^- "i?" here is a dummy stand-in for any 
value which is indexed over levels, trials, etc., such as F, x, and so on. Also, the 
average may be weighted if applicable. 

c(n): The location (as index in the treatment chain) of the averaging cutoff point iden- 
tified by the auto-detect estimation method. 

F^: The binomial point estimate of F at obtained simply by calculating the pro- 
portion of positive responses at lu accumulated until the point in time at which the 
estimate is taken. 

F: The linear interpolation of F between design points. 

The general meta-population (or hyper-population) of potential reported pain out- 
comes, out of the overall population of patients and pain evaluators. A concept used 
to interpret the meaning of the anesthesiology experiment in Section 13.51 

C: The likelihood. 

M: The length of a monotonicity-violating interval identified by the PAVA or CIR 
algorithms. 
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TLeff: The effective sample size: the number of hypothetical i.i.d. observations needed 
to achieve the same rate of variance reduction for the sample mean as that observed. 

Vad- Arithmetic average of all treatments, beginning with the "auto-detect" point; a 
suggested estimator of Qp. 

Vf Arithmetic average of all treatments, beginning with the j-th reversal; a possible 
estimator of Qp. 

Wf Arithmetic average of treatments at reversal points only, beginning with the j-th 
reversal; a popular estimator of Qp. 



Wj : Sim ilar to wj, except that the average is somewhat modified. This was 



Wetherill et al 



(|l966l )'s original reversal estimator (now less commonly used than Wj] 



Zu'. A value used to denote the running estimate of Fu during the PAVA and CIR 
algorithms. Its initial value is equal to F^. 



fij^: The mean of the stationary distribution. 

Pj'. The length (in trials) of the interval between the j-th and j + 1-th hitting times 
at a certain Markov-chain state. 

fP: The treatment chain's j-th order autocorrelation coefficient. If j is omitted, p 
refers to the first-order coefficient (note the difference from the use of p in Chapter 2). 

a: The standard deviation of thresholds. 

cTtt: The standard deviation of vr. 

Uu'. The weight given to a point estimate at 1^ (usually proportional to in the 
context of PAVA/CIR). 
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Chapter 4 

BAYESIAN PERCENTILE-FINDING DESIGNS 
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4.1 Overview 



4.1.1 History 

There is evidence for earlier sources, but credit for the first massive and practical introduc- 
tion of Ba yesian designs to perceii t ile-fin dinp; goes to Watson and Pelli in psychophysics, 



circa 1980 ( Watson and Pelh 



197S 



19831 ). Their algorithm, nicknamed QUEST, is widely 
used (though perhaps not as widely as U&D-descended designs) and even implemented 
into automated experiment control software. However, the current statistical debate is 
dominated by QUEST'S younger relative CRM (Continual Reassessment Method) - a de- 
sign using the same pr inciples, developed independently vis-a-vis the Phase I application 



(|Q'Quiglev et al.1 . [1993)0 

[n that latter paper, the motivation and principles were clearly 



described, and CRM's advantage over both the '3+3' Phase I design and U&D was forcefully 
argued. The main argument is that using only the outcomes of a single cohort to determine 
allocation is inherently inferior to model-based allocation using information from all trials. 

CRM's essential features have not changed much since 1990. However, some modifica- 
tions have been suggested to cope with what appears t o be CRM's g reatest obstacle: its 



perception as risky by the medical community (see, e.g., iPalmeij (|2002l )). This is not only a 



[ate doses too quickly ( 


Korn et al. 


(Goodman et al. 


1995) - has been 



by some leading biostatisticians (jMathew et al 



2004 



Pisters et al. 



2004 ) . Similar concerns 



have surfaced with respect to QUEST in psych ophysics (jAlcala-Quintana and Garcia-Perea . 
20041 : iGarcia- Perez and Alcala-Quintanal . l2005l ) . Recent CRM designs seem to focus on more 
sophistic ated solutions to this problem. One of them, Escalation with Overdose Control 
(EWOC,|Babb_elalJ,ll998|) will be discus sed in the next s ubsection. Both CRM and EWOC 



are available online as software packages (|Xu et al.l . 120071 ) . An article titled "CRM tutorial" 
for general audiences, including discussion of its modifications, has been recently published 
(jGarrett-Maveij . 120061 ) . Two nonparametric or "model- free" approaches to CRM have also 
been suggested - one of them still using t he Bayesian framewo r k wit h prior and posterior 
distributions, but with a saturated model (jGasparini and Eiseld . I2OOOI ) . a nd one where allo- 
cation decisions are based solely on a nonparametric local estimate of F (jLeung and Wana . 



2001 



Yuan and Chappell 



2004 



Ivanova et al 



20071 . see below in Section I4.1.3P . 



The current popularity of Bayesian approaches in the statistical community notwith- 
standing, the relative merit of Bayesian percentile- finding designs is hard to assess, espe- 
cially for Phase I trials. Phase I clinical trials is not an application that easily accommo- 



^ Interestingly, O'Quigley et al. did not cite QUEST, and were probably not aware of its existence. 
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dates repeated or large-scale comparative experiments. To date, the only clear theoretical 
result regarding CRM properties, is that one-parameter models con verge to optimal allo- 
cation (to be defined later) under rather restrictive conditions on F (jShen and Q'Quiglevl . 



1996 



Cheung and Chappell . l2002l ). In most other CRM publications, claims for a given de- 



sign's advantage are usually supported by logical arguments or by simulation comparisons 
with competing methods. A recent mini-review co-authored by one of CRM's developers 
(jO'Quiglev and Zoharl . |2006| ) even claims that numerical simulations are the recommended 
approach to compare percentile- finding designs. 

Unfortunately, simulations are prone to 'rigging', whether intentional or unintentional. 
The most common 'rigging' in favor of parametric designs is to choose scenarios in which 
F is correct l y spe cified (or very closely approximated) by the author's parametric model 



(Babbetal 



19981 ). Another fallacy is using the wrong estimator for the 'competitor' meth- 
ods. This is almost always the case whenever U&D appears in CRM publications - not 
intentionally, but out of ignorance amon g statisticia.ns reg arding common U&D usage. For 
example, the paper introducing EWOC (|Babb et al.l . Il998l ) claims its advantage over a host 
of competitors, including stochastic approximation, CRM, the '3-1-3' protocol, GU&D(3 0,2) 
and KR with k = 2. However, all simulation scenarios were generated via the location-scale 
logistic model used by the authors' EWOC, and U&D 'estimation' was defined as the dose 
allocation following the last trial. By contrast, a recent comparative simulation study in 
psychophysics, performed by researchers who contributed to both U&D and Bayesian de- 
signs and who are acquainted with 'best practices' in both, has yielded the conclusion that 
QUEST is the worst of the five desig ns tested, especially as far as robustness (or in their 
terminology, 'usability') is concerned (jCarcia- Perez and Alcala-Quintanal . l2005l ) . 

In general, the safe assumption (lacking a clear and verified scientific model for F) is 
that any parametric model used is misspecified. Therefore, when examining the properties 
of CRM or other parametric designs, our attention will focus on what happens when the 
model is misspecified. 



4--1-2 Description of Common Bayesian Designs 
Definition and Terminology 

Definition 5 Let a CRM scheme he a generic Up-and-Down design (see{l\ (i)), also 
having: 

1. A response model G{x,9), which is a (sub) distribution function on x for any 9 £ Q, 
and a strictly monotoneo continuous algebraic function of 9, with continuous first and 



■^Shorthand for: strictly monotone in each dj holding x and the other parameters constant. 
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second partial derivatives everywhere in Q (hereafter: the CRM modelj. The family 
of possible curves for a given model will be referred to as G. 

2. A prior distribution Il{6\(f)) for the model's parameters^ 

3. A two-stage decision rule for allocating the next treatment Xj+i, using the posterior 
distribution of 6, li (Qp|xi, . . . Xi,yi, . . . yi, (p)). 



4- A stopping rule. 

The monotonicity in each parameter is needed for model identifiability. The allocation 
rule will be further defined and refined in the next section. There has been some work 



regarding CRM stopping rules (e.g. IZohar and Chevretl . l2003l ) : however, here I assume for 



simplicity that the stopping rule is "stop after the fixed-size allocated sample is exhausted" 
- which is still the most commonly used rule. 

Unlike U&D, Bayesian designs almost always assume a finite set of treatment levels. 

Allocation rules 
quest's original rule is 

Xi+i = lu,u = argmax (11 iQp\xi,. . . Xi, yi, . . . yi, (/>))) , (4.1) 

with the maximum taken over design points only. That is, allocation goes to the (discrete) 



posterior mode over design points o The original CRM rule is 

Xi+i =l{^,u = argmin|/„ - E[Qp\xi,. . . Xi,yi, . . .yi,n(6')]| , (4.2) 
or in words: all o cate the next treatment to the level 'closest' to Qp's posterior mean 



(jO'Quiglev et al.l . 119901 ). 'Closest' may be measured either on the treatment scale or on 
the response scale. Technically, the estimate is calculated either by viewing Qp as an al- 
gebraic function of 9 and estimating 9, by calculating at G"s posterior estimates at design 
points, or directly by looking at Qp's posterior distribution. 



■^The lower-case tt, usually used to denote prior and posterior in Bayesian literature, has already been 
taken in this text for denoting stationary distributions. 

''Doubtlessly, this rule was chosen for computation reasons. It allowed calculation to proceed as a simple 
updating of the previous posterior, instead of tedious integrals (MCMC in its current form did not exist 
at the time) . So this was probably the only option enabling individual researchers to use QUEST for their 
experiments. 
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Fol lowing; concerns regard ing potential jumps by CRM to liigh-toxicity territory (iKorn et alJ . 



mi) 



Goodman et al 



(|l995l ) suggested modifying this rule to: move one level in direction 
of Qp's posterior mean (or no move, if the current level is the closest). EWOC's rule is even 
more conservative: after setting an acceptable risk level a, allocation is 



Xi+i = max{/^, : Ft {G{x = 1^,6) > Qp\xi, . . . Xi,yi, . . . yi) < a} 



(4.3) 



or in words: allocate the next treatment to the largest level wh ose posterior prob ability of 
exceeding Qp is not greater than the predetermined risk level ( Babb et al. . IQQsl ) R This 
allocation rule can be traced to the asymmetric loss function 



a{Qp - x) X < Qp 
(1 - a){x - Qp) X > Qp 



(4.4) 



EWOC's originators used a risk level of a = 0.25. Note that since G is monotone in x, the 
EWOC estimator allocates to the level just below the 100a% percentile of QpS posterior 
distribution. Therefore, using median instead of mean with the 'plain' CRM approach is 
basically equivalent to using EWOC with a = 0.5, showing that these two CRM variants 
are very closely related. 



Models and estimation 

QUEST was introduced with a 4-parameter Weibull model|§ By contrast, the original 
model used by O'Quigley et al. was a one-parameter "hyperbolic tangent" or "power" 
model, which is essentially a 3-parameter logistic model with location and scale fixed and 
a free shape parameter. Since then, O'Quigley and others have repeatedly argued in favor 
of a one-parameter CRM and against using more parameters. The reasoning is that since 
Qp is a single parameter, using more than 1 parameter in the model that tracks it down 
is dangerously redundant and less robust. A currently popular one-parameter CRM model 
is G{lu) = {G^^\lu)^^ , where the G^'^'^ values a t design points are d etermined according to 
prior knowledge, and ^ is a single parameter (jMathew et al.l . |2004| ). Clearly, the original 
"power" model is a special case of this family. Another variation is one-parameter scale 
logistic, with fixed location (|Garrett-Maveii . 120061 ). However, many other CRM researchers 
use the 2-parameter, location-scale logistic model, perhaps simply because it is the most 



^EWOC's originators actually assume a continuous treatment range, in which case the next dose can be 
allocated so that the risk is exactly a. 

®Two of these parameters are the 'false positive' and 'false negative' rates used in psychophysics, and two 



are the familiar shape and scale Weibull parameters. 
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commonly used model in medical dose-response applications, and can be generalized to 
logistic regression. EWOC developers recommend this model as well. 

Following the CRM experiment, estimation is a straightforward: most often, CRM 
estimation is simply the next allocation. This is in stark contrast to U&D, where 
allocation and estimation are quite separate. Sometime minor variations appear, such as 
using the MLE for estimation (which is equivalent to a uniform prior), but the principle 
remains the same. The practice of using allocation for estimation may seem self-evident, 
but it may also raise questions, especially with regards to CRM. CRM's developers still 
claim that a one-parameter model should be used, regardless of researchers' beliefs about 
the properties of the true F. This is understandable for allocation, but once the experiment 
is over one would think that the final estimate should be generated using the most realistic 
model, not the one deemed most effective for mid-experiment allocation. However, that 
point is not often debated nowadays. 



.1.3 Nonparametric CRM 



A radically different modeling approach was recently suggested: here the main argument is 
that misspecified mod els impose too many constr aints, preventing the CRM scheme from 
adapting to the data (jCasparini and Eiseld . l200d ) . The proposed solution is to use a sat- 
urated model: a model with m parameters, that can exactly match F on {Im}- This 
model has not been generally accepted in CRM circles, and its performance is not quite 
spectacular, even in the simulations presented by its originators. One counter-argument to 
saturated CR M was that satura. ted models, too, impose d constraints via the r t iodel' s prior 
- as claimed bv lO'QuiglevI (|2002l ) in a direct rejoinder to lCasparini and Eiseld (j200d ) - but 
that unlike parametric models, these constraints are less easily tractable. 



A truly nonparametric approach was suggested bv iLeung and Wand (|200l|). Here, in- 
stead of a Bayesian posterior estimate, the point estimates to be used at each stage are 
derived from isotonic regression. At each step, the next level is chosen from the current 
level and its two immediate neighbors, according to whose IR estimate is closest to p. This 
design can also be seen as a constrained CRM with a saturated model and no prior (i.e., a 
constrained MLE-based allocation). 

Two modifications to this design have since appeared, both of them relaxing the con- 
dition s somewhat to include an interval of F values that allows remaining at the same 
le vel. lYuan and Chappelll (|200J) suggest the interval [p, 2p], tailored to very low targets. 



Ivanova et al 



(|2007l ) suggest a more generic window: [p — Ai{p),p + A2{p)]. For conve- 
nience, the authors preferred to use Ai(p) = A2(p) = A(p). Recommended values for A(p) 
were determined by intense numerical scenario searches run on the tpm's of GU&D designs 
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with increasing cohort size (the authors view a decision at Z„ as a GU&D decision with a 
cohort size equal to n„). For 0.1 < p < 0.25, A(p) = 0.09; A(0.3) = 0.10 and A(0.5) = 0.13. 
This design has been dubbed Cumulative Cohort Design (CCD), and will be discussed 
later on. 
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4.2 Theoretical Study: Convergence 

As CRM researchers admit, since the application is a smah-sample one good asymptotic 
properties are no guarantee for actual performance. On the other hand, convergence proper- 
ties may help us understand important as pects of small-sample behav ior. The main existing 



result regarding CRM convergence, from lShen and O'QuiglevI (|l996l ). deals with misspeci- 
fied one-parameter models. These models can guarantee to match the true F at one point0 
the hop e is that the CRM s c heme ensures this point will be the design point closest to 
{Qp,p)- IShen and O'Quiglev were able to prove convergence to this point almost 



surely as n — > oo, only by assuming that the family of G curves be "close" enough to F, 
so that no matter which va lue of F is matched by G, the optimal level is still allocated. 



Cheung and Chappell (|2002l ) illustrate this restriction, noting that it is quite unrealistic for 
one-parameter models; essentially, it means that the model is only "mildly misspecified." 
They conjecture that the condition may be relaxed, but do not present a proof. As far as I 
know, no convergence result exists for multi-parameter models. 

During my work I have attempted to find more general CRM convergence proofs, as a 
tool to better understand the properties of this design approach. Following is a summary of 
my main results, beginning with one-parameter models. For the remainder of this section, I 
assume that in all cases, regardless of the allocation method, there is only a single optimal 
level (i.e., we rule out ties). Additionally, I assume that the design is constrained CRM (i.e., 
CRM limited to transitions of ±1 level), rather than the original unconstrained desig: 

4-2.1 Preamble 



[(le 
n§ 



First, I find that CRM's input data are better described by running proportions and point 
estimates than by the treatment and response chains. 

Lemma 4 The statistics (^{pm} , {Fm}^ , with {pm} defined via 

Tt 

Pu = —, (4.5) 

n 

and {Fm} defined in eq. \3.S^) in Chapter\^ are sufficient for 9 in the CRM likelihood. 

The proof follows easily from the factorization theorem. 
Next, we look more closely at the CRM allocation rule. 

^It is possible that more points would match, but this would be a lucky coincidence, not a matching 
guaranteed by the model. 

*It can be proven that the two become equivalent as n increases, and therefore the proofs hold for both; 
but this thesis has enough text in it already, and unconstrained CRM is not recommended in any case. 
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Definition 6 The CRM allocation rule is composed of two stages: 
1. A point estimate 9, generically defined as 

e = A,({p^},{Em}\G,4>). (4.6) 



6 must be a continuous function of {j5m},{i^m}? with continuous first-order 
partial derivatives. 

This estimate is then used to generate G = G{9) and Qp = G~^{p). Or, these statistics 
can be estimated directly from their posterior distributions, in such a way that they 
obey the same smoothness requirements w.r.t. {pm}-,{Fm}- 

2. A distance-based allocation Xn+i = A2{G). Common options include 

(a) "Closest treatment"; argmin^ ~ Qp i 

(b) "Closest response"; argmin^ G^—p ; 

(c) "Just under"; max|^u : < Qp|- 

Part 1 of this definition ensures that the CRM estimate is a smooth function of the 
sufficient statistics. If one takes the posterior mean, median or (continuous) mode of 
n {o\{Prn\ ^ {Fm} 1 to estimate 6, this smoothness condition should hold§| 

Finally, let us define a key property of parametric models in general and of CRM in 
particular. 

Definition 7 •A CRM model G{x, 9) will be said to have d degrees of freedom 

(d.f.), if for any e > 0, any set of d indices {ui < . . . < Ud) € {1 . . . m} and any set 
of corresponding real numbers < ai < . . . < a^^ < 1, there exists a domain V £ Q, 
containing a d-dimensional open rectangle, such that \Guj — aj| < e yO £ T>,j £ 

i...d. Also, u{9) > E e. 



• A CRM model will be called economical if d is equal to the number of parameters. 

• A CRM model will be called misspecified if there exists e > 0, such that there is no 
9 £ Q for which |G„ — Fu\ < e Vn € 1 . . .m, where F is the true response-threshold 
CDF. Otherwise, the model will be called correctly-specified. 



^If for some reason it does not, then probably that specific CRM scheme is inadequate for use. 
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In words, a model with d d.f. can always fit up to d values of F arbitrarily closely at 
any given time. All CRM models discussed here are assumed to be economical, i.e. that 
d equals the number of parameters. Typical standard parametric models usually answer 
this requirement. For example, the 1-parameter shape logistic model known as "power" 



(|Q'Quiglev et al.l . 119901 ) has d = 1, a. standard location-scale logistic or normal model has 
d = 2, and a shape-scale Weibull model has d = 2 if responses are known to be positive!^ 
Clearly, a CRM model with d = m d.f. cannot be misspecified. The case d = m is of course 
a saturated model. 

4-2.2 Convergence Proofs for Misspecified One-Parameter Models 
Note that the posterior distribution 



n {6\n, {pm}, {Fm}, (/'j oc / (xi, ...X \e)ii{e\<p) (4.7) 

can be written as 

Yl{e\n,{pm}AFra}A) = (xi , . . . x„, yi , . . . y„; 0) n(e | </>), (4.8) 

where C, a shorthand for the marginal distribution of the data, is not a function of 6. 
As n — > oo, the prior's relative weight diminishes, and the posterior's form converges to 
that of the likelihood for all parameters about which our information continues to increase. 
Therefore, Bayesian estimators - whether the mean, mode or a given posterior percentile 
- will converge to the MLE. Moreover, this means that asymptotically it does not matter 
whether the statistics G, Qp are estimated directly or via 6. With one parameter this 



assurnption always holds, and it has been used by IShen and O'QuiglevI (jl996l ) in their 
pr oofr^l Similarly, my inspecti on of CRM asymptotics will be likelihood-based. 



Shen and O'QuiglevI (|l99d ) have shown that correctly specified and "mildly misspeci- 
fied" CRM schemes with d = \ converge almost surely to the optimal allocation. Instead of 
constraining F, I now constrain the treatment chain. 

Lemma 5 Suppose that a CRM experiment has a limiting distribution vr = (vri . . . vTm)' (in 
the sense that as n —> oo, ^ vr^ Vu G {1 .. . m} in probability; of course, X^„vr„ = !)■ 
Then at most 2 elements of vr are nonzero, and the corresponding levels must be adjacent to 
each other. 

^'^Note that i n literature one may encounter non-economical model choices - for example, the 1-parameter 
scale logistic (|0'Quiglev and Chevretill99lh . for which a careless choice of /io, the fixed location parameter, 
can prevent the model from fitting all design levels as Qp. Such a model will have d = d.f. 

^^I will discuss the multi-parameter case more carefully later on. 
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Proof Let us first assume allocation according to the "Closest treatment" or "Just under" 
rules. We use the properties of G, Qp outlined in Definitions I5|6i Since both are continuously 
(partially) differentiable in {pm}-, {-Pm}, as the former tend to a limit so will they: G —> G, 
and in particular Qp — > Qp, in probability. Therefore, for every e > there is an no(e) 
such that for all n > uq, Pr (^Qp G [Qp — s/3, Qp + s/3]j > 1 — e. But all points in this 
interval imply an assignment by stage A2 in Definition [H] either a single level or to one of two 
adjacent levels. Thus, at most 2 levels have a nonvanishing probability of being assigned. 
As to the "Closest response" rule, the conditions on G imply that G ^ G as well, and 
therefore allocation would behave in a similar manner. □ 

The proof's rationale is illustrated in Fig. 14. 1[ Note that the lemma says nothing about 
whether the method converges correctly. 
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Figure 4.1: Illustration of the proof in Lemma [5l From the lemma's conditions and from 
generic CRM model smoothness properties, the model curve G converges to a limit curve G, 
with a limit target-estimate Qp (which is generally not the true target). This estimate must 
lie between two design points, depicted here as lu,lu+i- Due to the curve's convergence, 
the probability of Qp straying far enough from Qp to allocate other levels beside these two 
(the distance s/3 for "Closest treatment" allocation is used in the proof and shown here), 
is diminishing. 

Lemma Os proof emphasizes the importance of checking the model G's properties. If we 
are not assured that outputs (estimates) are smooth in the inputs (raw data), as specified 
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in Definition [H we cannot guarantee convergence. This is because the maximal-likelihood 
set may include a positive-measure region in 0. If the allocation's first step hops arbitrarily 
within that region, the allocation itself may end up spread out across several levels. For- 
tunately, a posterior-based decision process (posterior mean, percentile or mode) should 
usually fulfill the necessary requirements. 

Another consequence of Lemma [S] is that for every e > there is a sufficiently large n, 
such that the score equations can be rewritten as 



dl 



ndOj 



ddj G.a{l-G^) dOj G^+l(l-G^,+l) 



< e,j = l...(i, 
(4.9) 

assuming of course that the two nonzero elements are tt^,, 7r„_|^i. 

Now, recall that for a given d = 1 curve family only a single curve can (via the 
likelihood equation) be guaranteed to pass through a specified point (x,F(x)). Let this 
curve be called the CRM curve that matches F at x. 

Theorem 8 For a CRM scheme with d= \, assume a limiting distribution vr exists. Then: 
(i) If TT has exactly two nonzero elements indexed u,u + l, and if all CRM curves G 
matching F in lu+i] obey Gu+i — Gu < — F^ (i.e., the model is "shallower" than 
the true curve in the vicinity of \luilu+i]) , then lu,lu+i must be the two levels around target. 

(a) Under the "Closest response" or "Just under" allocation rules, if tt has a single 
nonzero element indexed u, and if the model is "shallower" than the true curve in the 
vicinities of and [lu,lu+i], (assuming w.l.o.g. 1 < u < m), then the scheme 

converges in probability to optimal allocation, i.e. allocation only to the closest level 
to target. 

Proof (i) Since there are two elements, the solution to the score equations (j4.9p tends to a 
limit curve G obeying 

Gu+i — Fu+i = —K [Gu — Fu 



for some positive constant K. Clearly, G must match F somewhere in {lu^lu+i)- Since all 
G curves answering this criterion are "shallower" than F in the vicinity of this segment, we 
have Gu > Fu and Gu+i < -^u+i- Now assume by contradiction and w.l.o.g. that Fu > p and 
u > 1, so that lu,lu+i are not the design points around target. Then, obviously, p < Fu < 
Gu < Gu+i- The probability of assigning lu+i diminishes, reaching a contradiction. □ 

(ii) The proof here is for the "Closest response" rule, but it can be easily tailored for 
the "Just under" rule as well. Since tTu is the dominant element, G will match F exactly 
at lu- Now, assume by contradiction that this is not the optimal level, and further assume 
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w.l.o.g. that lu-i is closer to target. This means that p < Gu- Since Gu-i G Fu-i-,Guj ■, 
Gu-i is closer to target than G„. lu-i becomes a superior assignment under G, reaching a 
contradiction. □ 

The proof's rationale for part (i) is illustrated on Fig. 14.21 
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Figure 4.2: Illustration of the proof in Theorem [8l part (i). We assume by contradiction 
that levels lu, lu+i each receive an non-diminishing fraction of the allocations, and that both 
are on the same side of target. Since d = 1, the model's limit curve G cannot match the 
true curve F at both design points, and instead matches it somewhere between them. Since 
G is "shallower" than F in this region, both G values are on the same side of target as well, 
ensuring the gradual elimination of allocations to lu+i - yielding a contradiction. 



4-2.3 Convergence Proofs for Multi- Parameter Models 

First, we must note that the likelihood-based approach to analyze CRM convergence is not 
automatically correct with d > 1. We also need that G{x) will be a nontrivial function of 
all parameters at all design points. I will assume this is the case (though it may not be true 
for some saturated model forms), and continue using the likelihood-based approach. Then 
Lemma Els result can be further strengthened with 2 or more d.f. 
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Lemma 6 Suppose that a CRM experiment has a limiting distribution vr = (vri . . . TTm)'. 
Then for d> 2 d.f., only a single element of tt is positive (i.e., it is equal to 1 and the rest 
are zero). 

Proof Following Lemma [5l let us assume that the only nonzero elements are ir^jTr^^i. For 
every e > and sufficiently large n, the score equations can be approximated by (|4.9p . The 
MLE is found by setting the score to zero. Since there are d > 2 such equations, the MLE 
converges to the degenerate solution G„ = = This means that as n ^ oo, 

— > Fy,Gy^i — > Fy^i. Since we have ruled out ties in the introduction to this section, 
one of these level is closer to the true target Qp. Therefore, for sufficiently large n, this level 
would dominate. □ 

The proof's rationale is illustrated in Fig. 14.31 Note that here, too, the lemma says 
nothing about whether the method converges correctly. 

We are now ready to prove convergence for misspecified-model CRM using the limiting- 
distribution approach, under certain conditions - regardless of how misspecified the model 
is. 

Theorem 9 Suppose that a CRM experiment has 

1. A limiting distribution vr = (tti . . . T^m)' ; 

2. d > 2 degrees of freedom; 

3. The "Closest response" or "Just under" allocation rule; and 

4. If T^u > 0, then lu-i and l^+i (when applicable) also belong to S, defined as 

S = {lu : Uu ^ 00 as n —> co} . (4-10) 

Then tt^* = 1, with u* being the index of the level closest to target. In other words, the 
CRM scheme converges to optimal allocation. 

Proof Again by contradiction. 

First, note that a level belonging to S does not automatically have a positive stationary 
frequency; all we know is that it is allocated infinitely often. In fact, due to Lemma [6l we 
know that only a single level, say 1^, has vr^ > (in fact, 7r„ = 1). 

Now assume w.l.o.g. that v > u* , meaning that > p. Since 



Uv » max (nt,_i, nt,+i) 
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(the relation a S> 6 is introduced here as a shorthand for 6/a — > as a — > oo), we can 
show using similar reasonings to that used above in Lemma [6l that F^. Therefore, 

Pr(G^ < p) — > and Pr(Gt, > p) ^ 1. Now, ly^i can only be allocated if Gy < p, and 
can only be allocated if Gv > p- Hence, not only ^ n^-i (given), but also n^-i ^ n^+i- 
So if any contradiction can arise, it can do so only from allocations to Iv-i- We focus our 
attention there. 

Since d > 2, G can match F in at least two points ~ the first being of course Z^. 
Since n„_i ^ n^+i, matching at Iv-i (rather than at Iv+i) will yield higher likelihood. 
At the same time, since — > oo (given), -F^_i -Ftj-i- Since it is given that l^-i is 



Gt,-i -p 



< 



Gv-p 



h 



closer to target, then (again similarly to LemmaEfs proof) Pr 
meaning that CRM allocations to l^^i will eventually dominate allocations to 1^ - producing 
a contradiction. □ 



The rationale for this proof, as well, can be understood via Fig. 14. 3i 

Why does Theorem [9] not apply to the "Closest treatment" allocation rule? Because 
then the allocation is based on modeling F between design points, where we have no direct 
information. A misspecified model would in general model the curve incorrectly. Even when 
G = at the two design points around Qp, the misspecification error between them can be 
large enough to perceive the second-closest level as closest. However, it can be shown that 
under the "Closest treatment" rule and assuming all other conditions hold, we can do no 
worse than second-closest. 

Lemma [6] and Theorem [9] hold regardless of whether the model is correctly specified. 
Can we do better assuming the model is correctly specified - as is the case with d = 17 

In this problem, the difference between misspecified and correctly-specified models boils 
down to the number of F values that can be simultaneously fitted - up to d for misspecified 
models, and up to m for correctly-specified models. For the latter, one would expect that 
the MLE's consistency - a standard textbook result - would guarantee CRM's optimal 
allocation convergence. However, the textbook r e sult re fers to i.i.d. sampling. Consistency 



under U&D sampling was proven by iTsutakawal (|l967al ) ; the key to the proof was the fact 



that allocation proportions converge to vr, similarly to the condition used in the misspecified- 
model proofs above. 

Our case is yet a bit more different. For U&D, every treatment level for which < < 1 
has a positive vr^j. For CRM, we have just proven that if vr exists then it has only 1 nonzero 
element if d > 1. This means that for d > 1 we cannot rely on the likelihood equations to 
provide a consistent MLE for 9, because there is no identifiability. With over one d.f. you 
can run an infinite number of curves through a single point. So at present. Theorem [9] is 
the most I can offer using the limiting-distribution approach, regardless of whether or not 
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Figure 4.3: Illustration of the proof in Lemma [6l Assuming by contradiction that levels 
lu,lu+i each receive an non-diminishing fraction of the allocations, and since d > 2, the 
model's limit curve G will match the true curve F at both design points. But then, since 
we have ruled out ties, one level will clearly emerge as closer to target (according to the 
allocation definition of "closer"), and from a certain point onward would dominate the 
allocations. Note that this would happen whether luJu+i are the two levels around target 
(as in the illustration) or not. 

This figure can also be used to understand Theorem [Dfs proof. Say that l^+i is CRM's limit 
level, but is closer to target. If 71^+2 are also unbounded, then » G will 

tend to match F at lu+i and lu, and ultimately lu (being closer to target) will supplant lu+i 
as the most frequently allocated level. 



the model is correctly specified. 



4-2.4 Nonparametric CRM-type Models 



Finall y, we turn to the nonparametric CRM-style designs mentioned earlier. iLeung and Wang 
([2001 )'s "isotonic regression" design can be seen as a special (prior-less) case of a d = m 



CRM scheme. Since there is no prior, the likelihood-based anlysis used here is applicable. 
However, as discussed above, the addition of more degrees of freedom beyond the second one 
does not appear to help guarantee convergence; in fact, it may hinder it due to premature 
over-fitti ng (see further below). Regarding the more sop histicated CCD scheme and similar 



designs (jlvanova et al.l . 120071 : lYuan and Chappelll . |2004| ). I have the following result. 
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Theorem 10 Schemes that repeatedly allocate levellu as long as F„ G [p — Ai{p),p + A2{p)], 
and move "up" or "down" if Fu is below or above this interval, respectively, will converge 
optimally if lu* is the only level whose true F value is in the said interval. 

Proof Assume the condition specified in the theorem holds. Then, any level lu that is 
allocated infinitely often will ultimately see F^ — > F^. If F^ lies outside the interval, 
allocation will eventually move away from - upward if F^ is below the interval, and 
vice versa. As n — > oo, this will happen for all levels except lu*- □ 



4-2. 5 Conceptual Interpretation of the Results 

The various proofs may leave worrying about conditions like "having an asymptotically 
dominant level, whose neighbors still receive allocations infinitely often" being too vague, 
not directly observable from the design or perhaps even circular. 

First, let us frame this in context. As far as I know, to date, a full generation since 
Bayesian percentile-fin ding schemes were first impl emented, there has been one (1) conver- 
gence proof published (|Shen and O'Quiglevl . 119961 ). with mu ch narrower applicabil i ty an d 
under conditions more restrictive than those postulated here (jCheung and Chappell . |2002| ) . 
Moreover, this proof - and the subject of CRM proofs in general - has not played a central 
role in the CRM debate The absence of theoretical results may indicate that they are 
not easy to come by, or that they require inconvenient conditions such as those I used. 

For example, the existence of vr was specified as a condition, though it perhaps can 
be proven in some cases. At face value, it may be hard to envision a treatment chain for 
this application that does not converge to some vr; however, quasi-periodical chains that 
gravitate between levels without stabilizing may be possible. In general the existence of 
vr is a random condition. The same goes for specifying that certain levels are allocated 
infinitely often. Whether or not a random condition holds, may depend upon an individual 
run's trajectory. 

In spite of using various approaches, I could not bring about a hermetic convergence 
proof for general CRM designs, regardless of the form of F, without resorting to random 
conditions. Without these conditions, a CRM experiment may find itself dug in a "hole" 
with very misguided point estimates surrounding the optimal level ~ point estimates which 
cannot be corrected over time with more information, since they themselves prevent the 



^•^ Acc ording to an August 14, 2007, ISI Web of Science search, 151 articles have cited lO'Quiglev et al.l 
ll 19901). CRM's "founding manifesto", since 2000. Over the same time period, only 23 articles cited 
IShen and O'Quiglevl l|l996l ). the paper presenting the above-menti oned proof - and 9 of these were co- 
authored by O'Qu igley. This compares, e.g., with 85 articles citing ICoodman et al.l l|l995h and 49 citing 
iBabb et all l| 19981 ') - two other key CRM articles published around the same time frame. 
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allocation of any trials to these levels. This means, that for many designs and distribution 
scenarios, some runs may converge and some not, depending upon their particular history. 
This is an important point, whose significance will be more fully understood in the next 
section. 

Meanwhile, one thing that can be learned from the proofs is CRM's operating principle. 
CRM takes advantage of two salient features of percentile- finding experiments: 

1. The pointwise convergence of {F^}, as mandated by the laws of large numbers. 

2. The problem's basic geometry in 2 dimensions, i.e. a continuous increasing F, a 
horizontal line corresponding to F = p which it must cross at some point, and the 
discrete grid - on which both decisions and observations must take place (c/. Fig. l4.ip . 

As seen above, CRM designs take advantage of the second (geometric) feature only if 
they have 2 or more d.f. This point has not been noticed so far by CRM theorists. 

Inverting the conditions spelled out in the definitions and the theorems, can help provide 
more details about what might stop CRM from converging: 

1. An allocation scheme whose posterior Qp is not smooth in the sufficient-statistic inputs 
(see Definition [S]) , or is not monotone in all parameters (see Definition ED cannot be 
guaranteed to converge. 

2. Non-economical models (see Definition [7]) should be avoided. 

3. One-parameter models seem better off with a "shallow" curve family. "Shallowness" 
improves the chain's mobility: the experiment is less likely to lock into a single level 
based on flimsy evidence. On the other hand, "shallow" models are more likely to 
end up oscillating between 2 levels rather than converging to a single level. The 
model's slope around target can be easily checked via sensitivity analysis before the 
experiment. 

4. Schemes are more likely to converge correctly when using the "Closest response" rule 
than when using the "Closest treatment" rule. This is because the former does not 
require correct specification between design points. To my knowledge, this has not 
been explicitly mentioned in CRM literature, though "Closest response" does appear 
to be more widely used. 

5. If the experiment locks into a single level early on and appears to never leave it, there 
is no guarantee that this is indeed the optimal level. This point can be inferred from 
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the conditions to most theorems and lemmas in the previous section, and will be 
discussed in more detail in the next section. 

It seems that for CRM convergence, misspecification and over-fitting are two sides of 
the same coin. With misguided point estimates, both a "steep" one-parameter model and 
a saturated model are prone to lock onto the wrong level, perhaps indefinitely. 

At bottom line, from a convergence point of view and with no knowledge of F''s true 
form, a flexible-shape 2 — 3 parameter model appears to be most suitable. However, there 
is no guarantee that this model type has an advantage for small n - where quick response 
may be more crucial than long-term fit, and therefore a one-parameter model may be more 
suitable. 

The nonparametric "isotonic regression" design is essentially a saturated CRM scheme, 
so it is prone to over-f itting and therefore not recommended. The modified CCD method 
of 



Ivanova et al. 



(120071 ). which allows a tolerance window, avoids some of the convergence 
pitfalls of saturated models because it forces transitions once the local point estimate is 
outside the window - regardless of the estimates at neighboring levels. However, it too is 
not guaranteed to converge optimally, unless level spacing is very coarse. 
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4.3 Theoretical Study: Small-Sample Behavior 

4.3.1 Conceptual Interlude 

Looking back at the two core features driving CRM's convergence (mentioned at the previous 
section's close), they seem simple and reliable enough. However, while feature 2 (geometry) 
always holds, feature 1 (point estimate precision) kicks in only when n becomes quite large. 
Before this happens, things can go wrong ~ most often via the "locking" phenomenon. 

More broadly, CRM is an estimation-based allocation scheme. At each point in 
the experiment, we choose to observe thresholds at the level most likely to be closest to 
target according to our model. This is in sharp contrast with U&D, where allocation is 
sampling-oriented: there, we quickly converge to random-walk sampling from vr, without 
insisting that each single trial be performed at the best level. Sampling proceeds regardless 
of where we believe the exact target lies. Through this sampling, U&D provides us with 
gradually better information about the location of vr's mean, and also about F values in the 
3 — 4 levels closest to it - enabling response-based estimation as well. 

CRM does not consider any given observation's value for estimation further down the 
road, but focuses on the here and now. This should be seen as a vulnerability, and is 
of course related to the "locking" phenomenon. Prom a common-sense perspective, we 
certainly should not presume that our model is good enough for meaningful estimation at, 
say, n = 5 ~ especially if we acknowledge that the model is misspecified. However, CRM 
is based precisely upon this presumption. Nonparametric CRM approaches, including 
CCD, share this basic presumption, and therefore also display CRM's core vulnerability. 

4.3.2 "Locking", "Gambling" and "Unlucky Batches" 

I introduce the notion of "lucky" and "unlucky batches" of thresholds. We can quan- 
tify these terms: a "lucky batch" is a sequentially-sampled group of, say, 5 — 10 thresholds, 
whose sample moments (most importantly the first two: mean and variance) are "close" to 
the population moments. Again, the meaning of "close" can be quantified: close enough so 
that, with a given spacing, the level closest to the batch's lOOp-th percentile is indeed the 
optimal levelP^I An "unlucky batch" is the opposite - a similarly sized sample whose mo- 
ments are far enough from population moments, to throw {Fm} off in a manner that causes 
CRM to allocate to the wrong level. "Unlucky batches" are to binary-response sampling 
what outliers are to ordinary direct sampling. 

Whether a batch is deemed "unlucky" in a particular experiment depends on the inter- 



Keep in mind, though, that thresholds are not directly observed. 
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play between the threshold variance Var{T), the level spacing s, the CRM model (including 
its prior specification) and also the "luck of the draw" - the particular sequence in which 
the batch is observed, and the treatment levels at which it is observed. The first two fac- 
tors can be summarized via AF, the difference in F values between adjacent levels around 
target. Once AF is fixed, some notion of how likely "unlucky batches" are can be gained 
by looking at Pr (^\Fu — F\ < AF^ at a single level Z„, for various values of p and AF, as a 
function of The higher the probability, the smaller the chance of an "unlucky batch". 
This is a simple, straightforward calculation of binomial probabilities. Fig. 14.41 shows this 
for AF = 0.2 (top) and 0.1 (bottom), and true values p = 0.2, 0.3 and 0.5. 

If AF is large - 0.2 or greater - the probability that a given batch would be "unlucky" 
decreases to about 20% by the time 5 for p = 0.2 (Fig. 14.41 top). If the target is closer 
to the median, it takes more trials - around n^t ~ 10 - to reach the same risk level. What if 
the design spacing is finer, so that AF = 0.1? The risk of "unlucky batches" dramatically 
increases: it hovers around 50% for riu ~ 5 — 15, regardless of target. A risk of less than 20% 
is not attained until well after the typical sample size for CRM experiments is exhausted 
(Fig. 14.41 bottom). All in all, it is reasonable to expect that in an experiment of size 
20 — 40, at least one "unlucky batch" of 5 — 10 observations will occur. 

"Unlucky batches" can throw any design off. An U&D chain, too, will trend up when 
encountering a sequence of uncharacteristically high thresholds, and vice versa. However, 
its Markov-chain short memory becomes an asset: by the next hitting time at a level close 
to ta rget, the probabili stic effect of previous excursions upon future allocations becomes 



zero (jTsutakawal . 
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On the other hand, CRM is a long-mem ory method - a prope r ty th at its developers are 



all too happy to point out as an advantage (jO'Quiglev and Zohaij . l2006l ) . The long memory 



itself is not necessarily detrimental, but combined with CRM's estimation-oriented alloca- 
tion scheme, CRM turns into a something of a gambling endeavor. The effect of outliers on 
estimation can be devastating for small n, but diminishes as n increases. Similarly, the ef- 
fect of an "unlucky batch" on CRM's estimation-allocation depends mostly upon when 
it occurs in the experiment. If the "unlucky batch" happens late enough, after point 
estimates have stabilized reasonably close to their true values, especially around target - 
the CRM chain will be quite resistant to subsequent "unlucky batches" . On the other hand, 
an early "unlucky batch" can create a "perfect storm". CRM would trend away from tar- 
get and lock onto the wrong levels, collecting subsequent information at the wrong place - 
thereby slowing down the self-correction process. Moreover, since the estimation-allocation 
uses all previous data, it will resist the correcting influence of more well-behaved batches 

^*Tlie effect on estimation is still retained, whether averaging or IR/CIR estimation is used. 
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Precision of Small-Sample Binomial Estimates 
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Precision of Small-Sample Binomial Estimates 
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Figure 4.4: Graphs of Probabilities of Pr ylF^ — p| < AFj as a function of n„, for AF 
(top) and 0.1 (bottom), and true F values of 0.2 (blue), 0.3 (red) and 0.5 (black). 
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following the "unlucky" one. 

To sum it up: CRM, with its long memory and estimation-based allocation, gambles 
that "unlucky batches" would occur late in the experiment - or not at all. If the gamble 
succeeds, we are quite likely to have a successful experiment. If it fails, we are almost 
assured of the opposite. 

This gambling property of CRM is demonstrated in Figs. l4.5l through l4.7[ In in Fig. 14.51 
we see the proportion of runs for which, out of the first 20 trials, at least 12 were allocated 
to the same single level (in symbols: the proportion of runs for which, at n = 20, pu > 0.6 
for some u). This is the proportion of runs for which CRM was quite "sure", during the 
early phase, that it "knew" where the target is. This proportion varies with target location 
(relative to the prior's weighting), and with distribution, but for one-parameter CRM (top) 
it remains at or above 60% of the runs. However, in about half of the cases, the gamble fails: 
the bulk of the first 20 trials were allocated to the wrong level (in all scenarios depicted, there 



is on ly a single optimal level). Also shown are data from a CCD experiment (jlvanova et al 



20071 ) , simulated over the same thresholds (Fig. 14.51 bottom) . The "gambling" tendency is 



somewhat subdued, but around 50% of runs, on average, are still dominated early on by a 
single level. The success rate is, if anything, even worse than CRM's. Note that for U&D 
designs, such an exclusive allocation pattern is all but impossible. 

In Fig. 14.61 we can see that the effect of early "gambling" lingers on. Shown are his- 
tograms of Pu* after 40 trials. That is: after 40 trials, the cumulative proportion allocated 
to the correct optimal level is calculated separately for each run; the proportion from all 
runs are then tallied into a histogram. These histograms can be interpreted as descriptive 
measures of convergence uniformity. This is done for U&D and for 3 different CRM 
models. If the design converges uniformly, we expect the distribution of pu* to be tight. 
This indeed occurs for U&D designs (top left-hand corner of each group of 4 histograms). 
The peak is rather modest in its location: pu* < 0.4 in most runs - but on the other hand, 
Pu* rarely falls below 0.2. CRM designs over the same thresholds (regardless of specific 
model) show a much large variability. With an "easy" distribution (normal, top), the en- 
semble mean of pu* for all 3 models is larger than U&D's. However, a sizable proportion of 
runs have little or no allocation at lu* after 40 trials. With a less convenient distribution 
(gamma with shape parameter 1.2, bottom), the ensemble mean is approximately the same 
for U&D and CRM, but the proportion of poorly-allocated CRM runs shoots up, especially 
with the one-parameter model. 

It should be noted, that while the ensemble mean of pu* is often reported in CRM 
simulation studies as a measure of the design's success, to my knowledge no study has 
looked at Pu* 's variability between runs. 
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Figure \47l\ shows a similar analysis comparing GU&:D2,o,i and the nonparametric CCD 
with cohort the same cohort size and target. The patterns are similar, with CCD having 
much larger variability. I used the same cohort size to make comparisons easier. Similar 
analysis with CCD and cohort size 3 (the size recommended by its developers) showed the 
same pattern. This indicates, that CRM's allocation robustness problem is not primarily 
caused by the use of a model, but rather by the reliance upon blunt point estimates to drive 
allocations early in the experiment. 

4.3.3 Effect of Prior 

In the previous section discussing convergence, the prior was almost always neglected. How- 
ever, for small samples it plays a decisive role. The first few allocations are not only driven 
by point estimates, but also by the prior. From numerical studies, it is clear that CRM 
performs much better when the true target corresponds to a high-density region of the 
prior, and vice versa. Naturally, the more parameters to the model, the "heavier" the prior 
becomes and the more time it takes for its effect to wear off. Since ultimately we'd like the 
data to play the leading role, this is one place where one-parameter models have a distinct 
advantage. In this work I have not attempted to tinker with priors too much, but usually 
optimized them so that Qp has a broad unimodal distribution centered around the middle of 
the design range, and so that the boundary levels have a smallest prior probability, but not 
a negligible one. The existence of a peak in Qp's prior, even a broad one, was sufficient to 
show a clear performance gap between scenarios in which in coincided with the true targets, 
and scenarios in which it did not. 
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Additional Glossary for Chapters 4-5 



BUD: Bayesian Up-and-Down: a family of methods combining U&D and CRM alloca- 
tion principles. 



CCD: Cumulative Cohort Design, a method recently developed by 
I characterize it here as a nonparametric Bayesian design. 



Ivanova et al. 



(120071) 



C RM: C ontinual Reassessment Method, the Bayesian design developed by lO'Quigley et al 
(|l990l ). Due to its prominence in statistical circles, it is also used here as shorthand 
for any Bayesian percentile- finding design. 



Babb et al 



E WOC : Escalation With Overdose Control, the Bayesian design developed by[ 

(|l998l ) . Essentially it is a CRM variant that aims to reduce the incidence of visits to 
treatments above target. 



QUEST: The Bayesian design developed by 



Watson and Pelli 



(|l979l . ll983l 'l. Very sim- 



ilar to CRM, but predates it by a decade and is almost unknown outside of psy- 
chophysics. 



d: The number of degrees of freedom in the model (see Definition [Tj) . 

G: The Bayesian model curve for F. 

G: The limit curve of G as n ^ oo (in case it has a limit curve). 

Q: The space of possible model curves under a given specification. 

Pu- The empirical cumulative proportion of allocations to for a given sample size. 

q: The cardinality of the set S (see below). 

5": The set of all levels receiving an unbounded number of allocation in a given ex- 
periment. In general this set is random. 
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u*: The index of the level closest to target according to the metric used (a.k.a. the 
optimal level). 

a: A key parameter in EWOC: allocation is to the a posterior quantile of Qp. 

(3: A key parameter in BUD: if the U&D allocation falls outside the (quantile-based) 
100(1 — 2(3)% posterior credible interval for Qp, it can be overridden by the Bayesian 
allocation. 

Ai (p) , A2 (p) : Parameters used in nonparametric Bayesian designs 

9: The parameter vector used to construct the model G, to create a tolerance window 
of F values, for which the current allocation is retained. 

©: The parameter space of d. 

H: Prior and posterior distributions of 6. 

(j): The parameter vector in 0's prior distribution. 
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Early Gambling: CRM 1-Parameter 'Power' Model 
Snapshot after 20 Trials (6 treatment levels, cohort size 2) 
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Early Gambling: CCD (nonparameteric CRM) 
Snapshot after 20 Trials (6 treatment levels, cohort size 2) 
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Figure 4.5: Charts depicting the proportion of CRM-type runs, in which at least 12 of the 
first 20 trials were ahocated to the same level. This proportion is then split cases when 
this level was correct (optimal, light blue) and wrong (maroon). The experiment simulated 
was a cohort design with cohort size 2, p = 0.3, m = 6 and seven different distributions 
arranged according to target location. Shown are the one-parameter "power" (logistic with 
free shape) model of Q'Quiglev et al. ( 199d ). optimized with "shallow" slope (top); and 
the CCD method of I van ova et al. ( 20071 ). with their recommended window width A = 0.1 
(bottom) - over the same simulated thresholds. 
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U&D Uniformity, 40 Triais (N=500) 



CRIU 2-Par Logistic Model Uniformity, 40 Trials 

g 



□ 



Cumulative Within-Run Proportion on Optimal Levt Cumulative Within-Run Proportion on Optimal Levt 



CRM 'Power' Model Uniformity, 40 Triais 



CRM 2-Par Weibull Model Uniformity, 40 Triais 



0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 

Cumulative Within-Run Proportion on Optimal Levt Cumulative Within-Run Proportion on Optimal Levt 



U&D Uniformity, 40 Trials (N=500) 



CRM 2-Par Logistic Model Uniformity, 40 Triais 
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Figure 4.6: Uniformity of U&D and CRM allocations, shown via histograms of Pu* from 500 
runs, after 40 trials. Simulations were non-cohort (or cohort size 1), with normal (top) and 
gamma, shape 1.2 (bottom) thresholds. In both cases m = 8. Each frame shows histograms 
of U&D runs (top left-hand corner), vs. CRM with the one-parameter "power" model 
(bottom left-hand corner), two-parameter logistic (top right-hand corner) and shape-scale 
Weibull (bottom right-hand corner). 
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CCD Uniformity, Uniform Tliresholds 
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Figure 4.7: Uniformity of CCD (top) and GU&D (bottom) allocations, shown via histograms 
of j5„* from 1000 runs, after 32 trials. Cohort size is 2 in designs targeting Qo.Si with uniform 
(top- left frames), and gamma, shape 0.8 (bottom- left), normal (top-right) and lognormal 
(bottom-left) thresholds. In all cases m = 6 and xi = l2- The location of in the frames 
was li,l2,h and Z4, respectively. 
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Chapter 5 

COMBINED DESIGN APPROACHES 
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5.1 Existing Designs and Current Knowledge 



Designs combining U&D with Bayesian or other long-memory methods are a recent and 
increasingly popular area of research. These designs focus mostly on the Phase I appli- 
cation. They are fueled by the dissatisfaction with traditional '3-1-3' type designs among 
statisticians, and by the failure of purely Bayesian designs to win over many practitioners 
due to their perceived risk. Since the Phase I field is rife with creative statisticians, I am 
sure that newer design s will continue to be publis hed as this thesis goes to print. 
A few researchers (jStoreii . 



2001 



Potted . I2OO2I ) have suggested the simple "two-stage" 



template familiar from U&D designs: start the experiment as U&D, then after the first re- 
versal change to a CRM or other model-based design. The two cited studies suggest starting 
with a median-targeting SU&D, then switch to a design that targets a lower percentile. 
An interesting combinat ion has originated from one of the few statistical groups studying 



U&D (Ivanova et al 



20031 ). They suggested a novel U&D variant that uses information 



from all previous trials in transition decisions. Attributed to an idea from an unpublished 
1950's thesis by Narayana, this method (dubbed here NM for "Narayana Method" ) requires 
the experiment to fulfill two conditions in order to move 'up' or 'down' at trial i: 1. Some 
U&D transition rule (e.g., BCD, KR, etc.), and 2. F{xi) < p for 'up' transitions, and vice 
versa for 'down' transitions. If only one condition is fulfilled, the experiment remains at the 
same level, i.e. Xi^i = Xj. NM's condition 2 is essentially nonparametric CRM, as discussed 
in Chapter HI Asymptotically, NM was proven to converge to a 2 — 3 level Markov chain 

around target (3 levels corresponds to the case when Qp is very close to a desig n level). 

NM appears to have been abandoned, but may have helped inspire the ideas of llvanova et al 



(j2007l ). Even though the latters' recent CCD method is presented as a fusion of U&D and 
CRM principles, it is in fact a nonparametric CRM, with U&D affecting it only indirectly 
in the determination of the window width A(p) (see Section [4. 1.3p . 



5.2 Theoretical Discussion 



At this point in the dissertation, the problems with the two-stage approach cited above 
are apparent. The first reversal happens too early in the experiment to guarantee reliable 
model-based estimation-allocation. Moreover, both authors chose to use median-targeting 
SU&D for the first stage, even though the experiment's ultimate target is lower. Even 
though this may seem like a clever way to offset the first reversal's typical downward bias 
(assuming the experiment starts low as is customary in Phase I), it means that a sizable 
proportion of these two-stage experiments would end up with a transition point close to the 
median. This is highly undesirable, especially for the toxicity-averse Phase I application. A 
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safer two-stage hybrid scheme would begin with an U&D whose target is closer to Qp, and 
then transition at reversal 3 or 5, or at a hitting time (see Section 3.6). 

Similarly, by now we may suspect what could go wrong with NM. First, being a CRM 
rule it suffers from the "locking" and "gambling" tendencies described in Section 14.3.11 
Moreover, even among CRM variants the saturated nonparametric model seems to have 
inferior convergence properties (see Section [4. 2. 5|) . since it shares little information between 
point estimates. Finally, the use of two criteria, both of which must be satisfied for a 
transition to occur, further am plifies the CRM-relat ed "locking" tendency. Therefore, even 



though the asymptotic proof in 



Ivanova et al 



is valid once the chain actually reaches 



the vicinity of target, it does not preclude the experiment being held up away from target 
for an indefinite amount of time. These problems have e i nerge d in some numerical trials I 
performed (data not shown; see more discussion in lOronI (|2005l )). 
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5.3 New Methodologies and Optimization 

5.3.1 Conceptual Overview 

Before proceeding to new combined Markovian-Bayesian designs, it may be helpful to list 
each approach's pros and cons. Using the insight gained by prior researchers and the added 
results reported in this thesis, the picture is clearer now. The table below compares the 
properties of "best in class" implementations of each approach. 

The main conclusion from Table [5T] is that there is probably no magic solution to the 
small-sample percentile-finding problem. In particular, the Achilles heel of both approaches 
is allocation. Even though U&D is listed as superior to CRM on most line items under "al- 
location", it is only a relative advantage. Run-to-run variability cannot be eliminated given 
the problem's nature. On the other hand, post-experiment estimation is not as problematic 
as it may appear. For U&D, the pros and cons of averaging estimators vs. CIR were clearly 
outlined in Chapter [3l and both are satisfactory if used "as directed" . Moreover, there is 
nothing preventing us from choosing model-based estimation after an U&D experiment, if 
we trust the model. Similarly, while CRM's "natural" estimator is the next allocation, we 
may supersede it with a more sophisticated model if warranted - or with CIR if we trust 
no model. 

Returning to allocation, both approaches are vulnerable to outlying "unlucky batches" 
and to variations in the form of F (though U&D is in general less sensitive) . Fortunately, to 
a large degree the two approaches are complementary. U&D is relatively more robust early 
on, and converges faster to its stationary behavior. As to CRM, its allocation performance 
improves with time. As n increases, it actually becomes more robust than U&D. 

As discussed in the previous section, current proposals to combine U&D and Bayesian 
designs fail to alleviate the worst small-sample risks. Instead of these ideas, I propose a 
gradual transition scheme, which is applicable regardless of sample size. Two options are 
described below. 

5.3.2 Randomized Transition 

Since CRM relies upon point estimates, its model's performance is indexed to n. The 
following randomization scheme utilizes this property: 
Randomized Bayesian Up-and-Down (R-BUD): 

• Start with a 5 - 10 trial U&D 'burn-in'; 

• For each subsequent trial, apply CRM allocation with probability n/ (n+no); otherwise 
apply U&D allocation. 
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Table 5.1: Side-by-side comparison of Markovian and Bayesian design properties. 
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The factor n can be replaced by ^/n, or any other increasing function (preferably one 
that has theoretical justification). Another optional safety precaution is to mandate that 
the first several allocations at each level will be U&D. 

Even though individual allocations are random, the R-BUD design actually guarantees 
transition to CRM-dominated allocations at a predetermined pace. This pace is controlled 
by the tuning parameter no- In view of results presented earlier in the thesis, good values 
for no are in the range 10 — 30. 



5.3.3 Inference-Based Transition 
Description 

R-BUD provides a reasonable combination template. However, it does not adapt to the data. 
In some cases, a "lucky" early run or a successful CRM model may provide opportunity for 
earlier transition; in other cases the opposite happens. 

At least to some extent, CRM itself can provide testimony about how well it is doing. 
This is sketched in Algorithm 3 below. 



Algorithm 3 Credible Bayesian Up-and-Down (C-BUD) 



procedure C-B\]D{xi,{lm},cutoS,n) 



while i < cutoff do 

(U&cD) 

i'^i + l 
end while 

while i < n do 

(U&cD) , (CRM) 

if xl^^ / xl^^ then 



Calculate Qp,/?, Qp^i-p ■ the CRM /3 and 1 — /? posterior quantiles of Qp 



, (CRM) 

else Xi+i 
end if 



then 



end if 
i^i + 1 
end while 
end procedure 



Notes: 
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1. The C-BUD scheme allows CRM to override U&D, only when a quantile-based 100(1 — 
2/3)% credible interval around the former is narrow enough to exclude a CRM alloca- 
tion identical to the U&D one0 As /? increases, a CRM override becomes easier. 

2. This allows the following easy interpretation: for /? > 0.5, C-BUD becomes CRM. For 
/3 = 0, C-BUD becomes U&D. Hence, it would seem that (3 = 0.25 is a roughly equal 
mix of the two designs. However, if the two allocations differ, then due to the discrete 
design CRM automatically begins with some positive baseline credibility level (on 
the order of 1/m). This is conceptually balanced by demanding the exclusion buffer 
around the 100(1 — 2/3)% credible interval. In practice, the range 0.15 — 0.25 for (3 
seems to reflect a relatively balanced combination. With these /? values we may expect 
around 10 — 15 trials to pass before observing the first Bayesian allocation override. 

3. An equivalent definition of the C-BUD rule is: take the tail closest to the U&D alloca- 
tion (including the U&D-allocated level itself), and calculate the posterior probability 
of Qp falling within that tail. If it is less than 100/3%, override the U&D allocation. 

4. Posterior-median CRM allocation fits naturally within this scheme and is strongly 
recommended over posterior mean or mode - though, of course, the method could be 
implemented with any CRM allocation rule. 

5. Usually, if the two allocations differ then x'^^f^^ = x^^^^^^ is. However, occasionally 
the U&D and CRM allocations point in different directions (i.e., x[^f^^ = Xi + 
SjX^^f^^ = Xi — s OT vice versa). In that case we have two options: 

(a) Proceed as in the usual case, i.e. use the CRM allocation whenever the U&D 
one is outside the calculated credible interval; 

(b) If the U&D allocation is "rejected" , use the CRM allocation only if 

Xi^ Qp^p - s/2,Qp^i_p + s/2 
as well. Otherwise Xj+i = Xj. 

6. For toxicity-averse applications such as Phase I, the design can be modified to have 
two different (3 values on the two tails. If we desire to minimize upward excursions, 
then the right-hand (3 should be larger. 



^This exclusion is ensured by adding the s/2 buffers on each side of the Cr.L; the method could be 
implemented without them, though this is not advisable; see note 2. 
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Lifelong (or converted) Bayesians would now ask, "what is the loss function?" First 
we should note that in spite of the use of Bayesian calculations, our framework is hybrid: 
one which evaluates a non-Bayesian procedure using Bayesian methods, but which gives the 
non-Bayesian decision a "home-court advantage" . Therefore, it is not clear that formulating 
a loss function provides added value, being mostly an exercise in reverse engineering. At 
this moment, I have no satisfactory formulation for the loss function. 

Because of C-BUD's greater theoretical appeal and flexibility, from this point on we 
limit discussion of BUD to this variant and leave the randomized R-BUD behind. From 
now on, referring to BUD will imply C-BUD, unless otherwise noted. 



5.3.4 Estimation and Stopping Rules 

In view of CRM's coherence advantage, one is tempted to use the n + 1-th CRM allocation 
as the BUD estimate. However, this is somewhat self-contradictory. After all, the BUD 
framework is about not accepting the CRM decision at face value; why, then use it as 
the final estimate? Moreover, the CRM allocation model is knowingly misspecified. If we 
use a model-based estimation, it better be the most realistic model we can suggest for the 
problem, rather than the CRM model. An alternative option is to use U&D estimators - 
either vad, which can be applied to any chain (not necessarily a purely U&D one), or CIR. 

The BUD framework suggests some interesting stopping rule options. The most straight- 
forward one would be to stop after a specified number of successful CRM overrides (defined 
as allocation that exceeds the required credibility threshold, whether or not it coincides with 
the U&D allocation), indicating that the model is fitting reasonably well around target. This 
subject is left for subsequent research. 



5.3.5 CCD-BUD Combination 



The frequentist interpretation suggests a way to combine CCD (jivanova et al.l . 120071 ) with 
U&D. We recall that CCD forces a transition whenever F{xi) is outside a pre-specified 
tolerance window around p. This is still CRM-style estimation-driven allocation, but it is 
somewhat less prone to "locking" than CRM (see Fig. 14. 5p . Additionally, since it has no 
prior and no curve assumptions, it is more robust to misspecification and to boundaries. 
In order to combine it with U&D, we use frequentist binomial confidence intervals for Qp, 
instead of the credible intervals. 
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5.3.6 Numerical Examples 

As a demonstration of the pros and cons of CRM and BUD, I present graphically some 
summaries from numerical simulations. The first (Figs. 15. H 15. 2p is of a one-trial-at-a-time 
experiment with m = 8; the second (Figs. 15.31 15. 4[) is of a cohort {k = 2) experiment with 
m = 6 and a low starting point, a la Phase I. Shown are estimation "success" probabilities, 
with "success" declared whenever the level closest to Qp is also the level closest to Qp (a 
rather standard measure of success in Phase I literature). The most salient observation is 
that success is very far from being assured; indeed, even after 30 — 40 trials, expecting 60% 
success is a bit optimistic under both frameworks. 

In the first example, the focus was on inspecting both BUD (color-coded to blue hues, 
(3 = 0.15 in all cases) and various CRM models (shades of red and pink). Here I refrained 
from simulating conditions with target on the boundary, which is equivalent to assuming 
that the design did not really have a physical boundary (i.e., treatments could in principle be 
increased or decreased beyond the nominal boundaries). The three CRM models differ wildly 
from each other in their performance pattern: one-parameter "power" tends to strongly 
over- or under-perform, while the two-parameter logistic seems most robust, but overall 
its success rate is relatively low, and seems to improve more slowly than others over time 
(compare the top figure for n = 20 with the bottom, n = 40). The BUD models track 
pretty closely to U&D at n = 20, but post visible gains over it at n = 40. Interestingly, the 
one-parameter BUD combination appears to be most robust. 

From the same simulations. Fig. 15.21 demonstrates convergence uniformity or robustness 
to "unlucky batches" , as in Fig. 14.61 For BUD runs, pu* is almost as tightly distributed 
around its mean as for U&D runs - but the peak shifts (sometimes almost imperceptibly) 
to the right. The better the model fit, the more substantial the rightward shift. The risk of 
very bad runs (very low p^*) only slightly increases compared to U&D. 

In the second numerical example I retained only the "power" model, since the focus was 
on Phase I style constraints, with smaller m (6) and a low starting level {I2). Here (Fig. 15.3]) 
GU&D was color-coded blue, CRM red, and BUD in shades of purple and pink, according 
to the value of (3 (0.15 to 0.35). Additionally, the x— axis was arranged so that the true 
targets are in increasing order. 'Hard', reflecting physical boundaries were assumed, as is 
the case in these applications. This immediately suggests using the boundary-resistant CIR 
for U&D and BUD estimation. 

A clear pattern emerges, with CRM holding the edge when targets are at or above the 
center of the treatment range, and U&D better for targets lower than center or on the 
boundaries. The explanation is quite mundane: CRM's prior has a (broad, but distinct) 
peak on levels 3 — 4. Hence, it has a forfeit advantage if these are indeed the targets, and 
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BUD Performance Across Distributions, 40 trials 

beta=0.15 for all models; 8 levels In all runs 
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Figure 5.1: Comparison of U&D (KR, k = 2), CRM and BUD (/3 = 0.15) estimation on a 
variety of simulated distributions, and n = 20 (top) and 40 (bottom). "Correct level" was 
defined as the level whose F value is closest to 0.3. The design had 8 levels, with all runs 
starting at level 4. The 'power' model is logistic with shape parameter and fixed location 
and scale (d = 1), introduced by O'Quigley et al. ( 199d ). The other two models have d = 2: 
'Logistic' is location-scale, and 'Weibuii' is shape-scale. The auto-detect averaging estimator 
VAD was used for U&D and BUD estimation. 
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Figure 5.2: Uniformity of U&D and BUD (/3 = 0.15) allocations, shown via histograms 
of pu* from 500 runs, after 40 trials. Simulations were non-cohort (or cohort size 1), with 
normal (top) and gamma, shape 1.2 (bottom) thresholds. In both cases m = 8. These 
are the same scenarios depicted in Fig. 14.61 Each frame shows histograms of U&D runs 
(top left-hand corner), vs. BUD combined with the one-parameter "power" model (bottom 
left-hand corner), two-parameter logistic (top right-hand corner) and shape-scale Weibull 
(bottom right-hand corner). 
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also to level 5 (which U&D has to climb 3 levels in order to reach). On the boundaries, CRM 
suffers from having a low prior weight placed there (which is intuitively natural; otherwise, 
the boundaries would be wider). 

How does BUD fare in these simulations? It seems that the design most exposed to CRM 
override (/? = 0.35) is not as robust as smaller exposures, without gaining the advantages of 
full CRM. Smaller values of (3 seem to remain well within the performance envelope created 
by U&D and CRM, regardless of who's on top. Overall, they track closer to U&D which is 
understandable given the small n, and perform better on the average than either U&D or 
CRM. 

Fig. 15.41 summarizes simulation results under the same conditions, but comparing U&D 
to CCD and CCD-BUD with (3 = 0.25. Note that the vertical scale is smaller here, because 
all designs are substantially more robust than CRM in terms of estimation success. CCD 
{k = 2), in particular, is the most robust design I've observed from that respect§| At n = 18, 
there are very few overrides with (3 = 0.25, and as a result CCD-BUD tracks very closely 
to U&D. Later on some differences emerge, and again CCD-BUD seems to perform, on the 
average, better than either U&D or CCD. 



^This does not indicate all-around robustness; in terms of run-to-run variability discussed on Section 
14.3.11 CCD is almost as variable as CRM and much less robust than U&D or BUD. See Fig. g??] 
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Performance Across Distributions, 18 trials 
Cohort Design (coliort size 2), 6 Levels, Isot. Reg. Estimates 
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Figure 5.3: Comparison of U&D (GU&D(2,o,i)), CRM (with cohorts of size 2) and BUD 
{f3 = 0.15,0.25,0.35) designs on a variety of simulated distributions, and n = 18 (top) and 
32 (bottom). 'Correct level' was defined as the level whose F value is closest to 0.3. The 
design had 6 levels, with all runs starting at level 2. Only the 'power' model {d = 1) was 
used for CRM and BUD. The CIR estimator was used for U&D and BUD. Distributions 
are arranged by target location, so that the leftmost distribution has level 1 as the optimal 
level, and the rightmost distribution has level 6. 
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Performance Across Distributions, 18 trials 
Coliort Design (coliort size 2), 6 Levels, Isot. Reg. Estimates 
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Figure 5.4: Comparison of U&D (GU&D(2,o,i)), CCD (with cohorts of size 2) and CCD-BUD 
(/3 = 0.25) designs on the same thresholds and under the same conditions as in Fig. 15. 3i 
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Chapter 6 

CONCLUSIONS AND FUTURE DIRECTIONS 
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Percentile- Finding: Limitations and Potential 

Revisiting the framework outlined in the introduction, we now have a better grasp of the 
problem's limitations and of the way they play out in practice. As the anesthesiology 
experiment demonstrated (Section 3.4), the variability encountered with live subjects is 
usually much greater and less tractable than when using a simplistic functional form for F, 
and experimental runs are never as smooth as computer simulation output. 

Considering the two main solution paths currently available (U&D and Bayesian), the 
choice is between quick convergence, albeit to a random walk that is by definition prone 
to excursions; and a design that locks onto a single level, without guarantee that it is in 
fact the best one. In the longer run, as far as estimation precision is concerned the two 
approaches would usually agree; however, the sequence of treatment allocations during the 
early phase is almost as different as it gets. 

Since the settings are those of an experiment, even if toxicity to live subjects is involved 
the decision whether to take a risk over the allotted sample has already been taken. Hence, 
placing "allocation purity" above all other considerations appears to be detrimental to the 
goal of controlling toxicity over a larger population (thanks to Barry Storer for emphasizing 
this point). Moreover, as Section 4.3 demonstrated, CRM's attempt to identify the best 
level early on may actually result in increased risk compared with U&D. 

On the other hand, the list of "best in class" U&D designs is short and imposes prac- 
tical constraints. For example, the fastest-converging cohort design targeting Q0.3 has 
a cohort size of 2; the nex t fast-converging option is with cohort size 5 (GU&:D(5 i 2); 



Gezmu and Flournovl (|2006l )). Designs with A; = 3, the most popular cohort size for Phase 
I studies, are much slower, similarly, for single-trial experiments the best design (KR) can 
approximate Qo.3,Qo.2 or smaller percentiles (somewhat underestimating the former and 
overestimating the latter) - but cannot work for other below- median targets such as Qqa- 
Another constraint has to do with boundaries, whose effect must be avoided or mitigated 
in one of the ways outlined on Chapter 3. 

CRM and its nonparametric offshoots (CCD, etc.) are much less constrained in this 
respect. They perform best (besides the more detailed conditions outlined in Chapter 4) 
when the spacing is coarse so that AF between levels is around 0.2 or more. Still, given 
their large small-sample variability, I would not recommend using an exclusive CRM-type 
design. 

If one plans a highly constrained experiment (i.e., with constraints that hamper U&D 
performance), a BUD-type hybrid design offers less risk than CRM/CCD alone. Without 
these constraints, "best in class" U&D should suffice, possibly with a down-shift scheme 
to smaller spacing midway through the experiment. The finer points of further optimizing 
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U&D remain to be explored. 

Centered Isotonic Regression 

Rather early in my work I accidentally discovered the fix described here as centered isotonic 
regression (CIR). It immediately showed a clear performance improvement; however, its 
theoretical justification and properties were not as clear. The theoretical treatment pre- 
sented in Sections 3.2 and 3.3 has been developed gradually since then (aided in part by 
Marloes Maathuis) . Many other ideas and directions related to CIR were not discussed here. 
This estimator may have applicability beyond U&D, perhaps as yet another modified-MLE 
estimator, to be almost automatically preferred over the MLE in ordinary settings due to its 
practical advantage (such as using rather than for normal sample variance estimates). 

Tailoring CIR to other applications may mandate different solutions for finding the 
optimal x-coordinate of pooled estimates. This is probably true with binary current-status 
data. Another interesting perspective is viewing CIR as a form of monotone linear splines, 
for which the node location is determined automatically. The latter idea may be extended 
to higher-order splines. 
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Appendix A: Proof of KR Unimodality 

In principle this appendix is redundant, but for some reason the question of KR's mode 
has become controversial and resulted in at least two article rejections - one by Gezmu and 
Flournoy and one by yours truly. Therefore, a detailed derivation proving that KR has a 
single stationary mode follows. 

As indicated in Theorem [H the proof looks at the stationary profile 7 as a continuous 
function of F. Since is a continuous strictly increasing function of x, it suffices to show 
that 7(-F) is monotone decreasing. We simplify equation (|2.9|) to the form 




(1) 



where = F{x + s). The derivative 'j' (F) has the same sign as 



F+ {(1 - + kF{l - + F'_^F{1 - Ff^ 

-F+ { (1 - + kF [(1 - + (1 - } - F;F(1 - F f. 



Canceling terms and dividing through by -F+(l — F) 



k-l 



we get 



l-F+^F{l-F)^+^-{ 



(1 - F)'=+^ -kF- 



(3) 



< 1 - F - (1 - - kF. 



This expression is zero at F = 0, and has derivative {k + 1) [(1 - F)^ - l] < for F > 0. 
Therefore it is negative for F > 0, and by extension 7(x) is decreasing. Therefore, by 
Lemma [H KR has a single stationary mode as claimed in Theorem [TJ 
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Appendix B: Code for The New Estimators 

Code for Auto-Detect Estimator 

Note that the AD estimator can be used not only for U&D chains, but for any vector 
that has a starting-point effect and eventually converges to some stationary behavior. In 
that latter case, however, care must be taken in interval estimation. One should use the 
autocorrelation-based estimate of rte// (sel in the code) and not the "hitting time" estimate 
which is based on U&D theory. 

ADmean<-f unction (x.bef ore=T,saf e=length(x)/4,full=T) { 

#### REMOVING 'FRONT TAIL' AND RETURNING MEAN OF REMAINING VALUES 

# Assaf Oron, 10/2007 

# ARGUMENT LIST 

# x: The vector to be averaged 

# before: if true (default), averaging begins from one position before the 

# identified trainsition point 

# safe: a position beyond which the algorithm stops looking. This means that 

# at least length(x) -saf e+1 elements will be included in averaging. Defaults 

# to length(x)/4. 

# full: if true, output includes information needed for for interval estimation. 

# Otherwise, only point estimate is returned. 

cutof f =round(scif e)+l 
n=length(x) 

base=sign(x [1] -meaii(x [2 :n] ) ) 
a=2 

while (a<cutoff & sign(x[a]-meaii(x[(a+l) :n] )) != -(base)) a=a+l 
if (before) a=a-l 

if (Ifull) return (outmean) ### POINT ESTIMATE ONLY 
#### NOW THE TOUGH PART - VARIANCE ESTIMATION 

#### WE BASE IT ON THE SD AND TWO ESTIMATES OF EFFECTIVE SAMPLE SIZE 
#### IN ANY CASE, WE USE ONLY THE AVERAGED PART OF x 

### THE FIRST IS AUTOCORRELATION BASED; WE LOOK AT 1ST AND 2ND DEGREE 

xx=x [a : n] 
nn=length(xx) 

corl=max(cor(xx[l: (nn-1)] ,xx[2:nn] ) ,0) 
cor2=mcLx(cor (xx [1 : (mi-2)] ,xx[3:mi] ) ,0) 
corf ac=max(corl , sqrt (cor2) ) 



### THE FOLLOWING SAMPLE SIZE ESTIMATE IS BASED ON DERIVATION IN THESIS 
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eff _nl=max(iui*(l-corf ac)/ (1+corfac) ,1) 

sig2=var(xx) 
sel=sqrt (sig2/ef f _iil) 

############ 

### NOW, CALCULATING S.E. VIA INTERVALS BETWEEN HITTING TIMES 
### AT A GIVEN LEVEL (MARKOV CHAIN THEORY) 

f irsts=c (if else (a==l , 1 , if else (x [a] ==x [a-1] ,0,1)) ,dif f (xx) ) 

tmp=table (xx [firsts ! =0] ) 

iiiiq=sort (imique (xx [firsts ! =0] ) ) 

dists=sort (abs (unq-mean(xx) ) , index . return=T) 

### The effective i.i.d. sample size is the mecin number of 

### hitting intervals for the two levels closest to our estimate 

ef f _n2=tmp [dists$ix [1] ] -1 
closest=imq[dists$ix[l] ] 

piececut=ifelse(firsts!=0 & xx==closest , 1 ,0) 

#cat (xx, ' \n' ) 

#cat (piececut , ' \n ' ) 

pieceid=cumsum(piececut) 

piecelens=sapply (split (xx,pieceid) , length) 

#### THIS, TOO, IS BASED ON A DERIVATION FOUND IN THE THESIS 
se2=sqrt (ef f _n2* (mean(piecelens~2)*(sig2+(outinecin-closest) ~2) 
+max(outmean, closest) "2*var (piecelens) )/nn"2) 

####### NOW WE TAKE THE MORE CONSERVATIVE S.E. ESTIMATE; BOTH WILL BE RETURNED ANYWAY 
se=max(sel , se2) 

#### THE D.F. ESTIMATE FOR T-STAT IS TAKEN FROM THE HITTING TIME APPROACH 
df=max(eff_n2-l,l) 

return ( c (outmean , se , df , a , sel , se2 , ef f _nl , ef f _n2) ) 
} 

Code for CIR Estimator 

First, a generic code for forward estimation (point estimate only). This code is completely 
analogous to the well-known PAVA code; in fact, parts of the R pavaO code are copied and 
pasted here. 

cir.pava <- function (y,x, wt=rep(l , length(x) ) ,boundary=2,f ull=FALSE) { 

# Returns centered-isotonic-regression y values at original x points # 

# Assaf Oron, 10/2007 
# 
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### ARGUMENTS: 

# y: y values (raw point estimates). Must match the x values. 

# They are given as first argument, both for compatibility with 'pava' 

# and to enable parallel running via 'apply' type routines 

# x: treatments. Meed to be pre-sorted in increasing order 

# wt : weights . 

# boundary: action on boundaries. Defaults to 2, 

# which is cinalogous to 'rule=2' on function 'approx', i.e. 

# returned y values are constant outside the boundaries. 

# boundary=l does linear extrapolation. 

# In addition, one can impose boundaries as inputs or 

# augment the output with boundaries, as discussed in 

# the dissertation text. 

# full: if FALSE, only point estimates at x values are returned 

### Validation stuff 

n <- length (x) 

if (n <= 1) { 

if (Ifull) return (y) 

else return(list (x=x,y=y ,z=x) ) 

} 

if (any (is .na(x) ) || £Lny(is.na(y))) { 

stop ("Missing values in 'x' or 'y' not allowed") } 
if (any(dif f (x)<=0) ) {stop ("x must be strictly increasing")} 

z<-x # Keeping a 'clean' copy of x for final stage 

Ivlsets <- (l:n) 
repeat i 

viol <- (as. vector (diff(y)) <= 0) # Find adjacent violators 
if ( ! (einy (viol) ) ) break 

i <- min( (1 : (n-1) ) [viol] ) # Pool first pair of violators 

y[i] <- (y[i]*wt[i]+y[i+l]*wt[i+l]) / (wt [i] +wt [i+1] ) 

x[i] <- (x[i] *wt [i]+x[i+l] *wt [i+1] ) / (wt [i] +wt [i+1] ) # new x is calculated 
wt [i] <-wt [i] +wt [i+1] # weights are combined 

# Deleting the i-l-th element 

y<-y[-(i+l)] 
x<-x[-(i+l)] 
wt<-wt [-(i+1)] 
n <- length (y) 
if (n <= 1) break 

} 

if (boundary==l) { 

### Utilize this option if you wish to use linear extrapolation 

### outside the boundary 

### (the 'approx' function does not have this option) 
### In general, this is *not* recommended; 



162 



### rather, impose boundary conditions whenever possible 
### (as inputs or after output of this function) 
### or use the default, constant boundary conditions 

if (x [n] <inax(z) ) { 

x<-c(x,max(z) ) 

y<-c (y , y [n] + (y [n] -y [n-1] ) * (x [n+1] -x [n] ) / (x [n] -x [n-1] ) ) } 
if (x[l]>min(z)) { 

x<-c (min(z) ,x) 

y<-c (y [1] - (y [2] -y [1] ) * (x [2] -x [1] ) / (x [3] -x [2] ) , y) } 

} 

# Now we re-interpolate to original x values, stored in z 

# If we didn't set boundary=l, then this will give constant 

# y values for x values falling outside new range of x 

if (Ifull) return (approx(x,y,z,rule=2)$y) 

else return(list (x=x,y=y ,z=z,wt=wt) ) 
} 

Now, inverse estimation specifically tailored for the U&D application. The input here is 
as a yes-no summary tabic of the binary responses. The output is a vector, with the first 
element an inverse point estimate of Qp, and the other elements percentile estimates of Qp 
for percentiles given by the user (for interval estimation). 

This code includes interval estimation as discussed in the text - again, tailored for U&D 

in which treatment allocation is random. The same principles can be applied to generate 

interval estimates for non-random allocations. 

The code uses some small utilities that follow below. 

cir .upndown<-f unction (yesno jXseq, target ,xbounds=c(0, 1) ,ybounds=c(0, 1) , 
full=F,cioption="poisson",plist=c(.025, .975)) { 

# Centered-isotonic-regression for Up-and-Down 

# Assaf Oron, 10/2007 

# Returns Point & Interval Estimates of Target percentile 
### ARGUMENTS: 

# yesno: Yes-no table of binary responses. Can contain rows of zeros 

# xseq: Treatment values matched to the responses 

# target: The target response rate (between and 1) 

# xbounds , ybounds : used for interpolation in case (estimated) 

# target falls outside CIR output boundaries. 

# full: complete output or estimates only 

# cioption: which method to use for interval estimation 

# can choose from 'poisson' , 't' or binomial (any other string) 

# plist: percentile list for interval estimation. Is a vector 

# of any length 



n_m<-yesno [ , 1] +yesno [ , 2] 
x<-xseq[n_m>0] 

y<-yesno [n_ni>0 , 1] /n_in [n_m>0] 

xbounds [1] <-min(xbounds [1] ,miii(x) ) 
xbounds [2] <-mcLX (xbounds [2] ,max(x) ) 

### Point Estimate ########################### 

# We start via forward estimation, just calling 'cir.pava' 

pavout <-cir . pava (y , x , wt=n_m [n_m>0] , f ull=TRUE) 

newx<-pavout$x 
newy<-pavout$y 
newn=pavout$wt 

### Error control if the interpolation target is outside boundary 
if (min(newy) >target) { 

newx<-c (xbounds [1] ,pavout$x) 

newy=c(ybounds [1] ,pavout$y) 

newy=c(ybounds [1] ,pavout$y) 

newn<-c (1 , pavout$wt) 

} 

mm=length (newy ) 

if (max (newy) <target) { 

newx<-c (newx , xboimds [2] ) 

newy=c (newy , ybounds [2] ) 

newn=c (newn , 1) 

> 

### The estimate is generated by using 'approx' with x and y intercheinged 

out=if else (length (newy) ==1 , newx , approx (x=newy , y=newx , 
xout=target ,ties="ordered" ,rule=2)$y) 

### Now to Interval Estimate ########################### 

# We find the point estimate's location w.r.t. our grid 

yplace=min (max (rank (c (target , newy) .ties .method="max") [1] -1 , 1) , length (newy) 
minn=min(newn[yplace] ,newn [yplace+1] ,na.rm=T) 

if (length (newy) ==1 II newy [yplace+1] ==newy [yplace] ) { 
### degenerate cases (error control) 

cigaps=if else (plist<0. 5, xbounds [1] -xbounds [2] .xbounds [2] -xbounds [1] ) 

# } 

y else { 
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yplratio= (target-newy [yplace] ) / (newy [yplace+1] -newy [yplace] ) 

if (cioption=='t ' ) { 

width=siiioothqt (plist .iiiinn, target) /minn 

} else if (cioption=='poisson' ) { 

width=smoothqpois(plist ,minn, target )/minn-target 

} else width=sinoothqbinom(plist,ininn,target)/iniiin-target 

### Inverting the forward interval estimates 

cigaps=width* (newx [yplace+1] -newx [yplace] ) / (newy [yplace+1] -newy [yplace] ) 

} 

#### Output 

if (Ifull) { returnCc (out , out+cigaps) ) 

} else return (list (raw=yesno ,paved=pavout , out=out , ci=out+cigaps) ) 
} 

Finally, the utilities used for interval estimation. 

sinoothqbinom=f unction (p , size , prob , add=T , half =F) { 

ql=qbinoin (p , size , prob) 
pl=pbinoin(ql , size , prob) 

### We smooth out the binomial quantile function, because n is random 

p2=pbinom(ql-l , size, prob) 
out=ql- (pl-p) / (pl-p2) 
if(add==T) { 

out [p>0 . 5] =out [p>0 . 5] +2 

out [p<0 . 5] =out [p<0 . 5] -1 

} 

if (half==T) out=out-0.5 

#### This part to ensure we are still conservative compared with 
#### the traditional queintile function 

out [p<0 . 5] =if else (out [p<0 . 5] <ql [p<0 . 5] , out [p<0 . 5] , ql [p<0 . 5] ) 
out [p>0 . 5] =if else (out [p>0 . 5] >ql [p>0 . 5] , out [p>0 . 5] , ql [p>0 . 5] ) 

out 

} 

#### 

smoothqt=f unction (p, size, prob) { qt(p,df=size-l)*sqrt(size*prob*(l-prob)) } 
#### 
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smoothqpois=f unction (p , size ,prob , add=F) { 

ref p=c (0.5, if else (p<0 . 5 , 1-p ,p) ) 
ql=qpois (ref p , leimbda=size*prob) 
pl=ppois (ql , lambda=size*prob) 
p2=ppois (ql-1 , lambda=size*prob) 

### We smooth out the Poisson quantile function, because n is random 
out=ql-(pl-refp)/ (pl-p2) 
extra=if else (add==T ,1,0) 

if else (p>0 . 5 , out [2 : length(out ) ] +extra, 2*out [1] -out [2 : length(out ) ] -extra) + 
size*prob-out [1] 



} 
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Appendix C: Some Technical Details of CRM and BUD simulations 

A wider range of CRM models than that which appears in Chapter [J] figures was simulated 
a nd contemplated durin g my work. In the end, I presented results from only three models 



Q'Quigley et alJ (jl99Cll )'s original one-parameter model, location-scale logistic and shape- 



scale Weibull - for the following reasons: 

• Only these three have been used in practice a reasonable number of times (the Weibull 
model is popular in sensory studies' QUEST) 

• Chapter Hfs theoretical results indicate little benefit from models with more than 2-3 
parameters; 

• Simulation running times increase exponentially with the number of parameters, as 
will be detailed below. 

The implementation involves calculation of posterior statistics. In terms of both com- 
puting time and precision, the worst-case default is MCMC. This is reasonably feasible 
option for a single actual experiment. However, in simulation (or sensitivity analysis) the 
calculation time could be prohibitive: for each single run, one has to simulate quite a few 
MCMC cycles (it seemed that no fewer than 20,000 cycles are usually needed), for each 
single trial. Moreover, as stated we are only interested in a handful of statistics at most 
- so having the complete posterior on our hands (as MCMC provides) is quite an overkill. 
Therefore, researchers in this field naturally look for shortcut s. Computationally the easi - 



est solution is to find a closed-form expression. For example, iGasparini and Eiseld (|200d ) 
simulated a saturated model using a "product of betas" prior that enabled them to find a 
closed-form expression for the posterior mean. 

The next-fastest solution is numerical integration. The posterior mean, either of 9, of 
{Gm} or of Qp, can be easily calculated using a ratio of two integrals. Posterior percentiles 
of Qp can also be found this way - by converting the likelihood equation so that Qp is one 
of the parameters, and integrating it over intervals (or parameter-space slices) of the form 
[lu — s/2,lu + s/2]. Numerical integration is typically more accurate than MCMC (tested 
over known prior distributions). The computing time increases exponentially with d, the 
number of parameters. For d = 1, a calculation that would take days using MCMC takes 
only minutes via R's integrate () function. For d = 2, it takes hours using adapt () - 
which is still a factor of 5 — 10 shorter than MCMC. For d > 3 the integration approach 



^Here I implicitly count the "power" model used nowadays as a generalization of lO'Quiglev et al] (|l990l )'s 
one-parameter model. 



167 



becomes more cumbersome symbolically, less stable numerically, and may in fact take more 
time than MCMC. 

In implementing BUD, I cut a few corners - while the CRM allocation was usually 
calculated using the "closest response" criterion mandated by theory, I calculated posterior 
percentiles (to determine override) using the "closest treatment" criterion, because this 
could be done via numerical integration rather than MCMC. Sensitivity analysis showed 
that the difference from doing a "closest response" calculation are quite small, and since 
the focus was on a proof of concept the extra computing expense was not worthwhile. 
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callous to human life, brutal, mentally destabilizing, humiliating and corrupting. 
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^About half a million Polish and Ulcrainian Jews were murdered in Belzec during its single year of operation, with 
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■^At the time, the "Silicon Valley" was not so multicultural, and teaching Americans how to pronounce his name 
was a mission beyond Assaf's capabilities. He settled for whatever they came up with, which included "Ozzie". 

■^His family arrived with him; Orna is currently making pottery and selling it at Pike Place Market. 



